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HIGH THROUGHPUT SCREENING FOR SEQUENCES OF INTEREST 

CROSS REFERENCE TO RELATED APPLICATIONS 

This application is a continuation-in-part of U.S. Patent Application Serial No. 
09/444,1 12, filed November 22, 1999 (pending), which is a continuation-in-part of 
5 U.S. Patent Application Serial No. 09/098,206, filed June 16, 1998 (pending), which 
application is a continuation-in-part of U.S. Patent Application Serial No. 08/876,276, 
filed June 16, 1997 (pending), the contents of which are all incorporated by reference in 
their entirety herein. 

FIELD OF THE INVENTION 

1 0 The present invention relates generally to screening of mixed populations of 

organisms or nucleic acids and more specifically to the identification of bioactive 
molecules and bioactivities by using high throughput screening techniques, including 
fluorescence activated cell sorting (FACS). 

BACKGROUND 

1 5 There is a critical need in the chemical industry for efficient catalysts for the 

practical synthesis of optically pure materials; enzymes can provide the optimal 
solution. All classes of molecules and compounds that are utilized in both established 
and emerging chemical, pharmaceutical, textile, food and feed, detergent markets 
must meet stringent economical and environmental standards. The synthesis of 

20 polymers, pharmaceuticals, natural products and agrochemicals is often hampered by 
expensive processes which produce harmful byproducts and which suffer from low 
enantioselectivity (Faber, 1995; Tonkovich and Gerber, U.S. Dept of Energy study, 
1995). Enzymes have a number of remarkable advantages which can overcome these 
problems in catalysis: they act on single functional groups, they distinguish between 

25 similar functional groups on a single molecule, and they distinguish between 
enantiomers. Moreover, they are biodegradable and function at very low mole 
fractions in reaction mixtures. Because of their chemo-, regio- and stereospecificity, 
enzymes present a unique opportunity to optimally achieve desired selective 
transformations. These are often extremely difficult to duplicate chemically, 
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especially in single-step reactions. The elimination of the need for protection groups, 
selectivity, the ability to carry out multi-step transformations in a single reaction 
vessel, along with the concomitant reduction in environmental burden, has led to the 
increased demand for enzymes in chemical and pharmaceutical industries (Fabef, 
5 1995). Enzyme-based processes have been gradually replacing many conventional 
chemical-based methods (Wrotnowski, 1997). A current limitation to more 
widespread industrial use is primarily due to the relatively small number of 
commercially available enzymes. Only -300 enzymes (excluding DNA modifying 
enzymes) are at present commercially available from the > 3000 non DNA-modifying 
10 enzyme activities thus far described. 

The use of enzymes for technological applications also may require 
performance under demanding industrial conditions. This includes activities in 
environments or on substrates for which the currently known arsenal of enzymes was 

1 5 not evolutionarily selected. Enzymes have evolved by selective pressure to perform 
very specific biological functions within the milieu of a living organism, under 
conditions of mild temperature, pH and salt concentration. For the most part, the non- 
DNA modifying enzyme activities thus far described (Enzyme Nomenclature, 1992) 
have been isolated from mesophilic organisms, which represent a very small fraction 

20 of the available phylogenetic diversity (Amann et al, 1995). The dynamic field of 
biocatalysis takes on a new dimension with the help of enzymes isolated from 
microorganisms that thrive in extreme environments. Such enzymes must function at 
temperatures above 100 °C in terrestrial hot springs and deep sea thermal vents, at 
temperatures below 0 °C in arctic waters, in the saturated salt environment of the 

25 Dead Sea, at pH values around 0 in coal deposits and geothermal sulfur-rich springs, 
or at pH values greater than 1 1 in sewage sludge (Adams and Kelly, 1995). Enzymes 
obtained from these extremophilic organisms open a new field in biocatalysis. 

For example, several esterases and lipases cloned and expressed from 
30 extremophilic organisms are remarkably robust, showing high activity throughout a 
wide range of temperatures and pHs. The fingerprints of five of these esterases show 
a diverse substrate spectrum, in addition to differences in the optimum reaction 
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temperature. As seen in Figure 1, esterase #5 recognizes only short chain substrates 
while #2 only acts on long chain substrates in addition to a huge difference in the 
optimal reaction temperature. These results suggest that more diverse enzymes 
fulfilling the need for new biocatalysts can be found by screening biodiversity. 
5 Substrates upon which enzymes act are herein defined as bioactive substrates. 

Furthermore, virtually all of the enzymes known so far have come from 
cultured organisms, mostly bacteria and more recently archaea (Enzyme 
Nomenclature, 1992). Traditional enzyme discovery programs rely solely on cultured 

1 0 microorganisms for their screening programs and are thus only accessing a small 
fraction of natural diversity. Several recent studies have estimated that only a small 
percentage, conservatively less than 1%, of organisms present in the natural 
environment have been cultured (see Table I, Amann et aL, 1995, Barns et. al 1994, 
Torvsik, 1990). For example, Norman Pace's laboratory recently reported intensive 

15 untapped diversity in water and sediment samples from the "Obsidian Pool" in 

Yellowstone National Park, a spring which has been studied since the early 1 960 s 
by microbiologists (Barns, 1994). Amplification and cloning of 16S rRNA encoding 
sequences revealed mostly unique sequences with little or no representation of the 
organisms which had previously been cultured from this pool. This suggests 

20 substantial diversity of archaea with so far unknown morphological, physiological and 
biochemical features which may be useful in industrial processes. David Ward's 
laboratory in Bozmen, Montana has performed similar studies on the cyanobacterial 
mat of Octopus Spring in Yellowstone Park and came to the same conclusion, namely, 
tremendous uncultured diversity exists (Bateson et al, 1989). Giovannoni et al 

25 (1990) reported similar results using bacterioplankton collected in the Sargasso Sea 
while Torsvik et al (1990) have shown by DNA reassociation kinetics that there is 
considerable diversity in soil samples. Hence, this vast majority of microorganisms 
represents an untapped resource for the discovery of novel biocatalysts. In order to 
access this potential catalytic diversity, recombinant screening approaches are 

30 required. 
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The discovery of novel bioactive molecules other than enzymes is also 
afforded by the present invention. For instance, antibiotics, antivirals, antitumor 
agents and regulatory proteins can be discovered utilizing the present invention. 

5 Bacteria and many eukaryotes have a coordinated mechanism for regulating 

genes whose products are involved in related processes. The genes are clustered, in 
structures referred to as "gene clusters," on a single chromosome and are transcribed 
together under the control of a single regulatory sequence, including a single promoter 
which initiates transcription of the entire cluster. The gene cluster, the promoter, and 
10 additional sequences that function in regulation altogether are referred to as an 

"operon" and can include up to 20 or more genes, usually from 2 to 6 genes. Thus, a 
gene cluster is a group of adjacent genes that are either identical or related, usually as 
to their function. 

15 Some gene families consist of one or more identical members. Clustering is a 

prerequisite for maintaining identity between genes, although clustered genes are not 
necessarily identical. Gene clusters range from extremes where a duplication is 
generated of adjacent related genes to cases where hundreds of identical genes lie in a 
tandem array. Sometimes no significance is discemable in a repetition of a particular 

20 gene. A principal example of this is the expressed duplicate insulin genes in some 
species, whereas a single insulin gene is adequate in other mammalian species. 

It is important to further research gene clusters and the extent to which the 
full length of the cluster is necessary for the expression of the proteins resulting 

25 therefrom. Gene clusters undergo continual reorganization and, thus, the ability to 
create heterogeneous libraries of gene clusters from, for example, bacterial or other 
prokaryote sources is valuable in determining sources of novel proteins, particularly 
including enzymes such as, for example, the polyketide synthases that are responsible 
for the synthesis of polyketides having a vast array of useful activities. As indicated, 

30 other types of proteins that are the product(s) of gene clusters are also contemplated, 
including, for example, antibiotics, antivirals, antitumor agents and regulatory 
proteins, such as insulin. 
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Polyketides are molecules which are an extremely rich source of bioactivities, 
including antibiotics (such as tetracyclines and erythromycin), anti-cancer agents 
(dauriomycin), immunosuppressants (FK506 and rapamycin), and veterinary products 
5 (monensin). Many polyketides (produced by polyketide synthases) are valuable as 
therapeutic agents. Polyketide synthases are multifunctional enzymes that catalyze 
the biosynthesis of a huge variety of carbon chains differing in length and patterns of 
functionality and cyclization. Polyketide synthase genes fall into gene clusters and at 
least one type (designated type I) of polyketide synthases have large size genes and 
10 encoded enzymes, complicating genetic manipulation and in vitro studies of these 
genes/proteins. The method(s) of the present invention facilitate the rapid discovery 
of these gene clusters in gene expression libraries. 

Of particular interest are cellular "switches" known as receptors which 
1 5 interact with a variety of biomolecules, such as hormones, growth factors, and 

neurotransmitters, to mediate the transduction of an "external" cellular signaling event 
into an "internal" cellular signal. External signaling events include the binding of a 
ligand to the receptor, and internal events include the modulation of a pathway in the 
cytoplasm or nucleus involved in the growth, metabolism or apoptosis of the cell. 
20 Internal events also include the inhibition or activation of transcription of certain 
nucleic acid sequences, resulting in the increase or decrease in the production or 
presence of certain molecules (such as nucleic acid, proteins, and/or other molecules 
affected by this increase or decrease in transcription). Drugs to cure disease or 
alleviate its symptoms can activate or block any of these events to achieve a desired 
25 pharmaceutical effect. 

Transduction can be accomplished by a transducing protein in the cell 
membrane which is activated upon an allosteric change the receptor may undergo 
upon binding to a specific biomolecule. The "active" transducing protein activates 
30 production of so-called "second messenger" molecules within the cell, which then 
activate certain regulatory proteins within the cell that regulate gene expression or 
alter some metabolic process. Variations on the theme of this "cascade" of events 
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occur. For example, a receptor may act as its own transducing protein, or a 
transducing protein may act directly on an intracellular target without mediation by a 
second messenger. 



5 Signal transduction is a fundamental area of inquiry in biology. For instance, 

ligand/receptor interactions and the receptor/effector coupling mediated by Guanine 
nucleotide-binding proteins (G-proteins) are of interest in the study of disease. A 
large number of G protein-linked receptors funnel extracellular signals as diverse as 
hormones, growth factors, neurotransmitters, primary sensory stimuli, and other 
10 signals through a set of G proteins to a small number of second-messenger systems. 
The G proteins act as molecular switches with an "on" and "off 1 state governed by a 
GTPase cycle. Mutations in G proteins may result in either constitutive activation or 
loss of expression mutations. 

15 Many receptors convey messages through heterotrimeric G proteins, of which 

at least 17 distinct forms have been isolated. Additionally, there are several different 
G protein-dependent effectors. The signals transduced through the heterotrimeric G 
proteins in mammalian cells influence intracellular events through the action of 
effector molecules. 

20 

Given the variety of functions subserved by G protein-coupled signal 
transduction, it is not surprising that abnormalities in G protein-coupled pathways can 
lead to diseases with manifestations as dissimilar as blindness, hormone resistance, 
precocious puberty and neoplasia. G-protein-coupled receptors are extremely 

25 important to drug research efforts. It is estimated that up to 60% of today's 

prescription drugs work by somehow interacting with G protein-coupled receptors. 
However, these drugs were developed using classical medicinal chemistry and 
without a knowledge of the molecular mechanism of action. A more efficient drug 
discovery program could be deployed by targeting individual receptors and making 

30 use of information on gene sequence and biological function to develop effective 
therapeutics. The present invention allows one to, for example, study molecules 
which affect the interaction of G proteins with receptors, or of ligands with receptors. 
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Several groups have reported cells which express mammalian G proteins or 
subunits thereof, along with mammalian receptors which interact with these 
molecules. For example, WO92/05244 (April 2, 1992) describes a transformed yeast 
5 cell which is incapable of producing a yeast G protein subunit, but which has been 
engineered to produce both a mammalian G protein subunit and a mammalian 
receptor which interacts with the subunit. The authors found that a modified version 
of a specific mammalian receptor integrated into the membrane of the cell, as shown 
by studies of the ability of isolated membranes to interact properly with various 
10 known agonists and antagonists of the receptor. Ligand binding resulted in G protein- 
mediated signal transduction. 

Another group has described the functional expression of a mammalian 
adenylyl cyclase in yeast, and the use of the engineered yeast cells in identifying 

15 potential inhibitors or activators of the mammalian adenylyl cyclase (WO 95/30012). 
Adenylyl cyclase is among the best studied of the effector molecules which function 
in mammalian cells in response to activated G proteins. "Activators" of adenylyl 
cyclase cause the enzyme to become more active, elevating the cAMP signal of the 
yeast cell to a detectable degree. "Inhibitors" cause the cyclase to become less active, 

20 reducing the cAMP signal to a detectable degree. The method describes the use of the 
engineered yeast cells to screen for drugs which activate or inhibit adenylyl cyclase 
by their action on G protein-coupled receptors. 

When attempting to identify genes encoding bioactivities of interest from 
25 complex environmental expression libraries, the rate limiting steps in discovery occur 
at the both DNA cloning level and at the screening level. Screening of complex 
environmental libraries which contain, for example, 1 00 s of different organisms 
requires the analysis of several million clones to cover this genomic diversity. An 
extremely high-throughput screening method has been developed to handle the 
30 enormous numbers of clones present in these libraries. 
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In traditional flow cytometry, it is common to analyze very large numbers of 
eukaryotic cells in a short period of time. Newly developed flow cytometers can 
analyze and sort up to 20,000 cells per second. In a typical flow cytometer, individual 
particles pass through an illumination zone and appropriate detectors, gated 
5 electronically, measure the magnitude of a pulse representing the extent of light 
scattered. The magnitude of these pulses are sorted electronically into "bins" or 
"channels", permitting the display of histograms of the number of cells possessing a 
certain quantitative property versus the channel number (Davey and Kell, 1996). It 
was recognized early on that the data accruing from flow cytometric measurements 
10 could be analyzed (electronically) rapidly enough that electronic cell-sorting 

procedures could be used to sort cells with desired properties into separate "buckets", 
a procedure usually known as fluorescence-activated cell sorting (Davey and Kell, 
1996). 

1 5 Fluorescence-activated cell sorting has been primarily used in studies of 

human and animal cell lines and the control of cell culture processes. Fluorophore 
labeling of cells and measurement of the fluorescence can give quantitative data about 
specific target molecules or subcellular components and their distribution in the cell 
population. Flow cytometry can quantitate virtually any cell-associated property or 

20 cell organelle for which there is a fluorescent probe (or natural fluorescence). The 
parameters which can be measured have previously been of particular interest in 
animal cell culture. 

Flow cytometry has also been used in cloning and selection of variants from 
25 existing cell clones. This selection, however, has required stains that diffuse through 
cells passively, rapidly and irreversibly, with no toxic effects or other influences on 
metabolic or physiological processes. Since, typically, flow sorting has been used to 
study animal cell culture performance, physiological state of cells, and the cell cycle, 
one goal of cell sorting has been to keep the cells viable during and after sorting. 

30 

There currently are no reports in the literature of screening and discovery of 
recombinant enzymes in E. coli expression libraries by fluorescence activated cell 
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sorting of single cells. Furthermore there are no reports of recovering DNA encoding 
bioactivities screened by expression screening in E. coli using a FACS machine. The 
present invention provides these methods to allow the extremely rapid screening of 
viable or non-viable cells to recover desirable activities and the nucleic acid encoding 
5 those activities. 

A limited number of papers describing various applications of flow cytometry 
in the field of microbiology and sorting of fluorescence activated microorganisms 
have, however, been published (Davey and Kell, 1996). Fluorescence and other forms 

1 0 of staining have been employed for microbial discrimination and identification, and in 
the analysis of the interaction of drugs and antibiotics with microbial cells. Flow 
cytometry has been used in aquatic biology, where autofluorescence of photosynthetic 
pigments are used in the identification of algae or DNA stains are used to quantify and 
count marine populations (Davey and Kell, 1996). Thus, Diaper and Edwards used 

1 5 flow cytometry to detect viable bacteria after staining with a range of fluorogenic 
esters including fluorescein diacetate (FDA) derivatives and CemChrome B, a 
proprietary stain sold commercially for the detection of viable bacteria in suspension 
(Diaper and Edwards, 1994). Labeled antibodies and oligonucleotide probes have also 
been used for these purposes. 

20 

Papers have also been published describing the application of flow cytometry 
to the detection of native and recombinant enzymatic activities in eukaryotes. Betz et 
al studied native (non-recombinant) lipase production by the eukaryote, Rhizopus 
arrhizus with flow cytometry. They found that spore suspensions of the mold were 
25 heterogeneous as judged by light-scattering data obtained with excitation at 633 nm, 
and they sorted clones of the subpopulations into the wells of microtiter plates. After 
germination and growth, lipase production was automatically assayed 
(turbidimetrically) in the microtiter plates, and a representative set of the most active 
were reisolated, cultured, and assayed conventionally (Betz et al, 1984). 

30 

Scrienc et al have reported a flow cytometric method for detecting cloned - 
galactosidase activity in the eukaryotic organism, S. cerevisiae. The ability of flow 
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cytometry to make measurements on single cells means that individual cells with high 
levels of expression (e.g., due to gene amplification or higher plasmid copy number) 
could be detected. In the method reported, a non-fluorescent compound a-naphthol- 
(3-galactopyranoside) is cleaved by P-galactosidase and the liberated naphthol is 
5 trapped to form an insoluble fluorescent product. The insolubility of the fluorescent 
product is of great importance here to prevent its diffusion from the cell. Such 
diffusion would not only lead to an underestimation of p-galactosidase activity in 
highly active cells but could also lead to an overestimation of enzyme activity in 
inactive cells or those with low activity, as they may take up the leaked fluorescent 
1 0 compound, thus reducing the apparent heterogeneity of the population. 

One group has described the use of a FACS machine in an assay detecting 
fusion proteins expressed from a specialized transducing bacteriophage in the 
prokaryote Bacillus subtilis (Chung, et.al., J. of Bacteriology, Apr. 1994, p. 1977- 

15 1984; Chung, et.aL, Biotechnology and Bioengineering, Vol. 47, pp. 234-242 (1995)). 
This group monitored the expression of a lacZ gene (encodes b-galactosidase) fused 
to the sporulation loci in subtilis (spo). The technique used to monitor b-galactosidase 
expression from spo-lacZ fusions in single cells involved taking samples from a 
sporulating culture, staining them with a commercially available fluorogenic substrate 

20 for b-galactosidase called C8-FDG, and quantitatively analyzing fluorescence in 
single cells by flow cytometry. In this study, the flow cytometer was used as a 
detector to screen for the presence of the spo gene during the development of the 
cells. The device was not used to screen and recover positive cells from a gene 
expression library or nucleic acid for the purpose of discovery. 

25 

Another group has utilized flow cytometry to distinguish between the 
developmental stages of the delta-proteobacteria Myxococcus xanthus (F. Russo- 
Marie, et.al., PNAS, Vol. 90, pp.8 194-8 198, September 1993). As in the previously 
described study, this study employed the capabilities of the FACS machine to detect 
30 and distinguish genotypically identical cells in different development regulatory 
states. The screening of an enzymatic activity was used in this study as an indirect 
measure of developmental changes. 
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The lacZ gene from E. coli is often used as a reporter gene in studies of gene 
expression regulation, such as those to determine promoter efficiency, the effects of 
trans-acting factors, and the effects of other regulatory elements in bacterial, yeast, 
5 and animal cells. Using a chromogenic substrate, such as ONPG (o-nitrophenyl-(-D- 
galactopyranoside), one can measure expression of -galactosidase in cell cultures; but 
it is not possible to monitor expression in individual cells and to analyze the 
heterogeneity of expression in cell populations. The use of fluorogenic substrates, 
however, makes it possible to determine P-galactosidase activity in a large number of 

10 individual cells by means of flow cytometry. This type of determination can be more 
informative with regard to the physiology of the cells, since gene expression can be 
correlated with the stage in the mitotic cycle or the viability under certain conditions. 
In 1994, Plovins et ai, reported the use of fluorescein-Di-p-D-galactopyranoside 
(FDG) and C12-FDG as substrates for P-galactosidase detection in animal, bacterial, 

15 and yeast cells. This study compared the two molecules as substrates for P- 

galactosidase, and concluded that FDG is a better substrate for P-galactosidase 
detection by flow cytometry in bacterial cells. The screening performed in this study 
was for the comparison of the two substrates. The detection capabilities of a FACS 
machine were employed to perform the study on viable bacterial cells. 

20 

Cells with chromogenic or fluorogenic substrates yield colored and 
fluorescent products, respectively. Previously, it had been thought that the flow 
cytometry-fluorescence activated cell sorter approaches could be of benefit only for 
the analysis of cells that contain intracellularly, or are normally physically associated 

25 with, the enzymatic activity of small molecule of interest. On this basis, one could 
only use fluorogenic reagents which could penetrate the cell and which are thus 
potentially cytotoxic. To avoid clumping of heterogeneous cells, it is desirable in 
flow cytometry to analyze only individual cells, and this could limit the sensitivity 
and therefore the concentration of target molecules that can be sensed. Weaver and his 

30 colleagues at MIT and others have developed the use of gel microdroplets containing 
(physically) single cells which can take up nutrients, secret products, and grow to 
form colonies. The diffusional properties of gel microdroplets may be made such that 
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sufficient extracellular product remains associated with each individual gel 
microdroplet, so as to permit flow cytometric analysis and cell sorting on the basis of 
concentration of secreted molecule within each microdroplet. Beads have also been 
used to isolate mutants growing at different rates, and to analyze antibody secretion 
5 by hybridoma cells and the nutrient sensitivity of hybridoma cells. The gel 

microdroplet method has also been applied to the rapid analysis of mycobacterial 
growth and its inhibition by antibiotics. 

The gel microdroplet technology has had significance in amplifying the 
1 0 signals available in flow cytometric analysis, and in permitting the screening of 

microbial strains in strain improvement programs for biotechnology. Wittrup et al 5 
(Biotechnolo.Bioeng. (1993) 42:351-356) developed a microencapsulation selection 
method which allows the rapid and quantitative screening of > 106 yeast cells for 
enhanced secretion of Aspergillus awamori glucoamylase. The method provides a 
1 5 400-fold single-pass enrichment for high-secretion mutants. 

Gel microdroplet or other related technologies can be used in the present 
invention to localize as well as amplify signals in the high throughput screening of 
recombinant libraries. Cell viability during the screening is not an issue or concern 
20 since nucleic acid can be recovered from the microdroplet. 

Different types of encapsulation strategies and compounds or polymers can be 
used with the present invention. For instance, high temperature agaroses can be 
employed for making microdroplets stable at high temperatures, allowing stable 
25 encapsulation of cells subsequent to heat kill steps utilized to remove all background 
activities when screening for thermostable bioactivities. 

There are several hurdles which must be overcome when attempting to detect 
and sort E. coli expressing recombinant enzymes, and recover encoding nucleic acids. 
30 FACS systems have typically been based on eukaryotic separations and have not been 
refined to accurately sort single E. coli cells; the low forward and sideward scatter of 
small particles like E. coli, reduces the ability of accurate sorting; enzyme substrates 
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typically used in automated screening approaches, such as umbelifferyl based 
substrates, diffuse out of E. coli at rates which interfere with quantitation. Further, 
recovery of very small amounts of DNA from sorted organisms can be problematic. 
The present invention addresses and overcomes these hurdles and offers a novel 
5 screening approach. 

There has been a dramatic increase in the need for bioactive compounds with 
novel activities. This demand has arisen largely from changes in worldwide 
demographics coupled with the clear and increasing trend in the number of pathogenic 

1 0 organisms that are resistant to currently available antibiotics as well as the need for 
new industrial processes for synthesis of compounds. For example, while there has 
been a surge in demand for antibacterial drugs in emerging nations with young 
populations, countries with aging populations, such as the U.S., require a growing 
repertoire of drugs against cancer, diabetes, arthritis and other debilitating conditions. 

15 The death rate from infectious diseases has increased 58% between 1980 and 1992 
and it has been estimated that the emergence of antibiotic resistant microbes has 
added in excess of $30 billion annually to the cost of health care in the U.S. alone . 
(Adams et al, Chemical and Engineering News, 1995; Amann et aL, Microbiological 
Reviews, 59, 1995). As a response to this trend, pharmaceutical companies have 

20 significantly increased their screening of microbial diversity for compounds with 
unique activities or specificities. 

The majority of bioactive compounds currently in use are derived from soil 
microorganisms. Many microbes inhabiting soils and other complex ecological 

25 communities produce a variety of compounds that increase their ability to survive and 
proliferate. These compounds are generally thought to be nonessential for growth of 
the organism and are synthesized with the aid of genes involved in intermediary 
metabolism. Such secondary metabolites that influence the growth or survival of 
other organisms are known as "bioactive" compounds and serve as key components of 

30 the chemical defense arsenal of both micro- and macroorganisms. Humans have 
exploited these compounds for use as antibiotics, antiinfectives and other bioactive 
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compounds with activity against a broad range of prokaryotic and eukaryotic 
pathogens (Barnes et ah, Proc.Nat. Acad. Sci. U.S.A., 91, 1994). 

The approach currently used to screen microbes for new bioactive compounds 
5 has been largely unchanged since the inception of the field. New isolates of bacteria, 
particularly gram positive strains from soil environments, are collected and their 
metabolites tested for pharmacological activity. 

There is still tremendous biodiversity that remains untapped as the source of 
10 lead compounds. However, the currently available methods for screening and 
producing lead compounds cannot be applied efficiently to these under-explored 
resources. For instance, it is estimated that at least 99% of marine bacteria species do 
not survive on laboratory media, and commercially available fermentation equipment 
is not optimal for use in the conditions under which these species will grow, hence 
1 5 these organisms are difficult or impossible to culture for screening or re-supply. 
Recollection, growth, strain improvement, media improvement and scale-up 
production of the drug-producing organisms often pose problems for synthesis and 
development of lead compounds. Furthermore, the need for the interaction of specific 
organisms to synthesize some compounds makes their use in discovery extremely 
20 difficult. New methods to harness the genetic resources and chemical diversity of 
these untapped sources of compounds for use in drug discovery are very valuable. 

A central core of modern biology is that genetic information resides in a nucleic 
acid genome, and that the information embodied in such a genome (i.e., the genotype) 

25 directs cell function. This occurs through the expression of various genes in the genome 
of an organism and regulation of the expression of such genes. The expression of genes 
in a cell or organism defines the cell or organism's physical characteristics (i.e., its 
phenotype). This is accomplished through the translation of genes into proteins. 
Determining the biological activity of a protein obtained from an environmental sample 

30 can provide valuable information about the role of proteins in the environments. In 
addition, such information can help in the development of biologies, diagnostics, 
therapeutics, and compositions for industrial applications. 
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Accordingly, the present invention provides methods and compositions to 
access this untapped biodiversity and to rapidly screen for polynucleotides and small 
molecules of interest utilizing high throughput screening of multiple samples. 

5 

SUMMARY OF THE INVENTION 

The present invention adapts traditional eukaryotic flow cytometry cell sorting 
systems to high throughput screening for expression clones in prokaryotes. In the 
present invention, nucleic acid libraries derived from DNA, for example DNA directly 
1 0 isolated from the environment, are screened very rapidly for bioactivities of interest 

utilizing fluorescence activated cell sorting (FACS). These libraries can contain greater 
than 10 8 members and can represent single orgapisms or can represent the genomes of 
hundreds of different organisms, species or subspecies. In one aspect, the libraries, or 
clones of the libraries are "biopanned" as a step in the high throughput analysis. 

1 5 Accordingly, in one embodiment, the present invention provides a process for 

identifying clones having a specified activity of interest, which process comprises (i) 
generating one or more expression libraries derived from nucleic acid directly isolated 
from the environment; and (ii) screening said libraries utilizing a high throughput cell 
analyzer, preferably a fluorescence activated cell sorter, to identify said clones. 

20 More particularly, the invention provides a process for identifying clones having 

a specified activity of interest by (i) generating one or more libraries, e.g. , expression 
libraries, made to contain nucleic acid directly or indirectly isolated from the 
environment; (ii) exposing said libraries to a particular substrate or substrates of interest; 
and (iii) screening said exposed libraries utilizing a high throughput cell analyzer, 

25 preferably a fluorescence activated cell sorter, to identify clones which react with the 
substrate or substrates. 

In another aspect, the invention also provides a process for identifying clones 
having a specified activity of interest by (i) generating one or more gene libraries derived 
from nucleic acid directly or indirectly isolated from the environment; and (ii) screening 
30 said exposed libraries utilizing an assay requiring a binding event or the covalent 
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modification of a target, and a high throughput cell analyzer, preferably a fluorescence 
activated cell sorter, to identify positive clones. 

The invention further provides a method of screening for an agent that modulates 
the activity of a target protein or other cell component (e.g. , nucleic acid), wherein the 
5 target and a selectable marker are expressed by a recombinant cell, by co-encapsulating 
the agent in a microenvironment with the recombinant cell expressing the target and 
detectable marker and detecting the effect of the agent on the activity of the target cell 
component. 

In another embodiment, the invention provides a method for enriching for target 
10 DNA sequences containing at least a partial coding region for at least one specified 

activity in a DNA sample by co-encapsulating a mixture of target DNA obtained from a 
mixture of organisms with a mixture of DNA probes including a detectable marker and 
at least a portion of a DNA sequence encoding at least one enzyme having a specified 
enzyme activity and a detectable marker; incubating the co-encapsulated mixture under 
1 5 such conditions and for such time as to allow hybridization of complementary sequences 
and screening for the target DNA. Optionally the method further comprises 
transforming host cells with recovered target DNA to produce an expression library of a 
plurality of clones. 

The invention further provides a method of screening for an agent that modulates 
20 the interaction of a first test protein linked to a DNA binding moiety and a second test 
protein linked to a transcriptional activation moiety by co-encapsulating the agent with 
the first test protein and second test protein in a suitable microenvironment and 
determining the ability of the agent to modulate the interaction of the first test protein 
linked to a DNA binding moiety with the second test protein covalently linked to a 
25 transcriptional activation moiety, wherein the agent enhances or inhibits the expression 
of a detectable protein. Preferably, screening is by FACS analysis. 

In yet another aspect, the present invention provides a method for identifying a 
polynucleotide of interest or a molecule having an activity of interest, including 
contacting a library containing a plurality of clones which include nucleic acid derived 
30 from more than organism or source, e.g. , a mixed population of organisms, including 



Gray Cary\GTV6 190 100.2 
104703-1 



16 



microorganisms or plant tissue, with at least one oligonucleotide probe labeled with a 
detectable, e.g., fluorescent, molecule. The detectable molecule changes, e.g., 
fluoresces, upon interaction of the probe to a target polynucleotide in the library. Clones 
from the library are then separated with an analyzer that detects the change in the 
5 detectable molecule, e.g., fluorescence. The separated clones can be contacted with a 
reporter system that identifies a polynucleotide encoding a polypeptide or a small 
molecule of interest, for example, and the clones capable of modulating expression or 
activity of the reporter system identified thereby identifying a polynucleotide of interest. 

In another embodiment, the invention provides a method for identifying a 
10 polynucleotide encoding a polypeptide of interest. The method includes co- 
encapsulating in a microenvironment a plurality of library clones containing DNA 
obtained from a mixed population of organisms with a mixture of oligonucleotide probes 
comprising a fluorescence marker and at least a portion of a polynucleotide sequence 
encoding a polypeptide of interest having a specified bioactivity. The encapsulated 
1 5 clones are incubated under such conditions and for such time as to allow interaction of 
complementary sequences and clones containing a complement to the oligonucleotide 
probe encoding the polypeptide of interest identified by separating clones with a 
fluorescent analyzer that detects fluorescence. 

In yet another embodiment, the invention provides a method for high throughput 
20 screening of a polynucleotide library for a polynucleotide of interest that encodes a 

molecule of interest. The method includes contacting a library containing a plurality of 
clones comprising polynucleotides derived from a mixed population of organisms with a 
plurality of oligonucleotide probes labeled with a fluorescence molecule wherein said 
fluorescence molecule fluoresces upon interaction of the probe to a target polynucleotide 
25 in the library; separating clones with a fluorescent analyzer that detects fluorescence; 
contacting the separated clones with a reporter system that identifies a polynucleotide 
encoding the molecule of interest; and identifying clones capable of modulating 
expression or activity of the reporter system thereby identifying a polynucleotide of 
interest. 
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In another embodiment, the invention provides a method of screening for a 
polynucleotide encoding an activity of interest. The method includes (a) obtaining 
polynucleotides from an environmental sample; (b) normalizing the polynucleotides 
obtained from the sample; (c) generating a library from the normalized polynucleotides; 
5 (d) contacting the library with a plurality of oligonucleotide probes comprising a 
fluorescent marker and at least a portion of a polynucleotide sequence encoding a 
polypeptide of interest having a specified activity to select library clones positive for a 
sequence of interest; (e) selecting clones with a fluorescent analyzer that detects 
fluorescence; (f) contacting the selected clones with a reporter system that identifies a 
10 polynucleotide encoding the activity of interest; and (g) identifying clones capable of 
modulating expression or activity of the reporter system thereby identifying a 
polynucleotide of interest; wherein the positive clones contain a polynucleotide sequence 
encoding an activity of interest which is capable of catalyzing the bioactive substrate. 

In yet another embodiment, the present invention provides a method for 
1 5 screening polynucleotides, comprising contacting a library of polynucleotides derived 
from a mixed population of organism with a probe oligonucleotide labeled with a 
fluorescence molecule, which fluoresce upon binding of the probe to a target 
polynucleotide of the library, to select library polynucleotides positive for a sequence of 
interest; separating library members that are positive for the sequence of interest with a 
20 fluorescent analyzer that detects fluorescence; expressing the selected polynucleotides to 
obtain polypeptides; contacting the polypeptides with a reporter system; and identifying 
polynucleotides encoding polypeptides capable of modulating expression or activity of 
the reporter system. 

In another embodiment, the invention provides a method for obtaining an 
25 organism from a mixed population of organisms in a sample. The method includes 

encapsulating in a microenvironment at least one organism from the sample; incubating 
the encapsulated organism under such conditions and for such a time to allow the at 
least one microorganism to grow or proliferate; and sorting the encapsulated organism 
by flow cytometry to obtain an organism from the sample. 

30 
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BRIEF DESCRIPTION OF THE FIGURES 



Figure 1 illustrates the protocol used in the cell sorting method of the invention 
to screen for a polynucleotide of interest, in this case using a (library excised into E. 
coli). The clones of interest are isolated by sorting. 

5 Figure 2 shows a microtiter plate where clones or cells are sorted in accordance 

with the invention. Typically one cell or cells grown within a microdroplet are dispersed 
per well and grown up as clones. 

Figure 3 depicts a co-encapsulation assay. Cells containing library clones are 
coencapsulated with a substrate or labeled oligonucleotide. Encapsulation can occur in a 
1 0 variety of means, including GMDs, liposomes, and ghost cells. Cells are screened via 
high throughput screening on a fluorescence analyzer. 

Figure 4 depicts a side scatter versus forward scatter graph of FACS sorted gel- 
microdroplets (GMDs) containing a species of Streptomyces which forms unicells. 
Empty gel-microdroplets are distinguished from free cells and debris, also. 

1 5 Figure 5 is a depiction of a FACS/Biopanning method described herein and 

described in Example 3, below. 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention provides a method for rapid sorting and screening of 
libraries derived from a mixed population of organisms from, for example, an 

20 environmental sample or an uncultivated population of organisms. In one embodiment, 
gene libraries are generated, clones are either exposed to a substrate or substrate(s) of 
interest, or hybridized to a fluorescence labeled probe having a sequence corresponding 
to a sequence of interest and positive clones are identified and isolated via fluorescence 
activated cell sorting. Cells can be viable or non- viable during the process or at the end 

25 of the process, as nucleic acids encoding a positive activity can be isolated and cloned 
utilizing techniques well known in the art. 

This invention differs from fluorescence activated cell sorting, as normally 
performed, in several aspects. Previously, FACS machines have been employed in 
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studies focused on the analyses of eukaryotic and prokaryotic cell lines and cell culture 
processes. FACS has also been utilized to monitor production of foreign proteins in both 
eukaryotes and prokaryotes to study, for example, differential gene expression. The 
detection and counting capabilities of the FAGS system have been applied in these 
5 examples. However, FACS has never previously been employed in a discovery process 
to screen for and recover bioactivities in prokaryotes. Furthermore, the present 
invention does not require cells to survive, as do previously described technologies, 
since the desired nucleic acid (recombinant clones) can be obtained from alive or dead 
cells. For example, the cells only need to be viable long enough to contain, carry or 

10 synthesize a complementary nucleic acid sequence to be detected, and can thereafter be 
either viable or non-viable cells so long as the complementary sequence remains intact. 
The present invention also solves problems that would have been associated with 
detection and sorting of E. coli expressing recombinant enzymes, and recovering 
encoding nucleic acids. Additionally, the present invention includes within its 

1 5 embodiments any apparatus capable of detecting fluorescent wavelengths associated 
with biological material, such apparatuses are defined herein as fluorescent analyzers 
(one example of which is a FACS apparatus). 

The use of a culture-independent approach to directly clone genes encoding 
novel enzymes from, for example, an environmental sample allows one to access 

20 untapped resources of biodiversity. In one embodiment, the invention is based on the 
construction of "environmental libraries" which represent the collective genomes of 
naturally occurring organisms archived in cloning vectors that can be propagated in 
suitable prokaryotic hosts. Because the cloned DNA is initially extracted directly from 
environmental samples, the libraries are not limited to the small fraction of prokaryotes 

25 that can be grown in pure culture. Additionally, a normalization of the environmental 
DNA present in these samples could allow more equal representation of the DNA from 
all of the species present in the original sample. This can dramatically increase the 
efficiency of finding interesting genes from minor constituents of the sample which may 
be under-represented by several orders of magnitude compared to the dominant species. 

30 Prior to the present invention, the evaluation of complex environmental 

expression libraries was rate limiting. The present invention allows the rapid screening 
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of complex environmental libraries, containing, for example, genes from thousands of 
different organisms. The benefits of the present invention can be seen, for example, in 
screening a complex environmental sample. Screening of a complex sample previously 
required one to use labor intensive methods to screen several million clones to cover the 
5 genomic biodiversity. The invention represents an extremely high-throughput screening 
method which allows one to assess this enormous number of clones. The method 
disclosed herein allows the screening anywhere from about 30 million to about 200 
million clones per hour for a desired nucleic acid sequence or biological activity. This 
allows the thorough screening of environmental libraries for clones expressing novel 
10 biomolecules. 

The invention provides methods and composition whereby one can screen, sort 
or identify a polynucleotide sequence, polypeptide, or molecule of interest from a mixed 
population of organisms (e.g., organisms present in an environmental sample) based on 
polynucleotide sequences present in the sample. Thus, the invention provides methods 

1 5 and compositions useful in screening organisms for a desired biological activity or 

biological sequence and to assist in obtaining sequences of interest that can further be 
used in directed evolution, molecular biology, biotechnology and industrial applications. 
By screening and identifying the nucleic acid sequences present in the sample, the 
invention increases the repertoire of available sequences that can be used for the 

20 development of diagnostics, therapeutics or molecules for industrial applications. 

Accordingly, the methods of the invention can identify novel nucleic acid sequences 
encoding proteins or polypeptides having a desired biological activity. 

Flow cytometry has been used in cloning and selection of variants from existing 
cell clones. This selection, however, has required stains that diffuse through cells 
25 passively, rapidly and irreversibly, with no toxic effects or other influences on metabolic 
or physiological processes. Since, typically, flow sorting has been used to study animal 
cell culture performance, physiological state of cells, and the cell cycle, one goal of cell 
sorting has been to keep the cells viable during and after sorting. 

There currently are no reports in the literature of screening and discovery of 
30 polynucleotide sequence in libraries by fluorescence activated cell sorting. Furthermore 
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there are no reports of recovering DNA encoding bioactivities screened by FACS and 
additionally screening for a bioactivity of interest. The present invention provides these 
methods to allow the extremely rapid screening of viable or non- viable cells to recover 
desirable activities and the nucleic acid encoding thoseactivities." 

5 Fluorescence and other forms of staining have been employed for microbial 

discrimination and identification, and in the analysis of the interaction of drugs and 
antibiotics with microbial cells. Flow cytometry has been used in aquatic biology, where 
autofluorescence of photosynthetic pigments are used in the identification of algae or 
DNA stains are used to quantify and count marine populations (Davey and Kell, 1996). 
1 0 Diaper and Edwards used flow cytometry to detect viable bacteria after staining with a 
range of fluorogenic esters including fluorescein diacetate (FDA) derivatives and 
CemChrome B, a stain sold commercially for the detection of viable bacteria in 
suspension (Diaper and Edwards, 1994). Labeled antibodies and oligonucleotide probes 
can also been used for these purposes. 

1 5 Papers have been published describing the application of flow cytometry to the 

detection of native and recombinant enzymatic activities in eukaryotes. Betz et ah 
studied native (non-recombinant) lipase production by the eukaryote, Rhizopus arrhizus 
with flow cytometry. They found that spore suspensions of the mold were 
heterogeneous as judged by light-scattering data obtained with excitation at 633 nm, and 

20 they sorted clones of the subpopulations into the wells of microtiter plates. After 

germination and growth, lipase production was automatically assayed (turbidimetrically) 
in the microtiter plates, and a representative set of the most active were reisolated, 
cultured, and assayed conventionally (Betz et ah, 1984). The ability of flow cytometry 
to make measurements on single cells means that individual cells with high levels of 

25 expression (e.g. , due to gene amplification or higher plasmid copy number) could be 
detected. 

Cells with chromogenic or fluorogenic substrates yield colored and fluorescent 
products, respectively. Previously, it had been thought that the flow cytometry- 
fluorescence activated cell sorter approaches could be of benefit only for the analysis of 
30 cells that contain intracellularly, or are normally physically associated with, the 
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enzymatic activity of a molecule of interest. On this basis, one could only use 
fluorogenic reagents which could penetrate the cell and which are thus potentially 
cytotoxic. In addition, gel microdroples (GMDs) can be used during FACS sorting and 
culturing. The use of GMDs containing (physically) single cells which can take up 
5 nutrients, secrete products, and grow to form colonies is useful in the present invention. 
The diffusional properties of GMDs may be made such that sufficient extracellular 
product remains associated with each individual GMD, so as to permit flow cytometric 
analysis and cell sorting on the basis of concentration of secreted molecule within each 
microdroplet. Beads have also been used to isolate mutants growing at different rates, 
1 0 and to analyze antibody secretion by hybridoma cells and the nutrient sensitivity of 
hybridoma cells. 

The GMD technology has had significance in amplifying the signals available in 
flow cytometric analysis, and in permitting the screening and sorting of microbial strains 
in strain improvement and isolation programs. GMD or other related technologies can 
15 be used in the present invention to localize, sort as well as amplify signals in the high 
throughput screening of recombinant libraries. Cell viability during the screening is not 
an issue or concern since nucleic acid can be recovered from the microdroplet. 

Different types of encapsulation strategies and compounds or polymers can be used 
with the present invention. For instance, high temperature agaroses can be employed for 
20 making microdroplets stable at high temperatures, allowing stable encapsulation of cells 
subsequent to heat-kill steps utilized to remove all background activities when screening 
for thermostable bioactivities. Encapsulation can be in beads, high temperature 
agaroses, gel microdroplets, cells, such as ghost red blood cells or macrophages, 
liposomes, or any other means of encapsulating and localizing molecules. 

25 

For example, methods of preparing liposomes have been described (i.e., U.S. 
Patent No.'s 5,653,996, 5393530 and 5,651,981), as well as the use of liposomes to 
encapsulate a variety of molecules U.S. Patent No.'s 5,595,756, 5,605,703, 5,627,159, 
5,652,225, 5,567,433, 4,235,871, 5,227,170). Entrapment of proteins, viruses, 
30 bacteria and DNA in erythrocytes during endocytosis has been described, as well 
(Journal of Applied Biochemistry 4, 418-435 (1982)). Erythrocytes employed as 
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carriers in vitro or in vivo for substances entrapped during hypo-osmotic lysis or 
dielectric breakdown of the membrane have also been described (reviewed in Ihler, G. 
M. (1983) J. Pharm. Ther). These techniques are useful in the present invention to 
encapsulate samples for screening. 

5 

"Microenvironment", as used herein, is any molecular structure which provides an 
appropriate environment for facilitating the interactions necessary for the method of 
the invention. An environment suitable for facilitating molecular interactions include, 
for example, liposomes. Liposomes can be prepared from a variety of lipids including 

10 phospholipids, glycolipids, steroids, long-chain alkyl esters; e.g., alkyl phosphates, 
fatty acid esters; e.g., lecithin, fatty amines and the like. A mixture of fatty material 
may be employed such a combination of neutral steroid, a charge amphiphile and a 
phospholipid. Illustrative examples of phospholipids include lecithin, sphingomyelin 
and dipalmitoylphos-phatidylcholine. Representative steroids include cholesterol, 

15 cholestanol and lanosterol. Representative charged amphiphilic compounds generally 
contain from 12-30 carbon atoms. Mono- or dialkyl phosphate esters, or alkyl 
amines; e.g., dicetyl phosphate, stearyl amine, hexadecyl amine, dilauryl phosphate, 
and the like. 

20 As used herein and in the appended claims, the singular forms "a," "and," and 

"the" include plural referents unless the context clearly dictates otherwise. Thus, for 
example, reference to "a clone" includes a plurality of clones and reference to "the 
nucleic acid sequence" generally includes reference to one or more nucleic acid 
sequences and equivalents thereof known to those skilled in the art, and so forth. 

25 Unless defined otherwise, all technical and scientific terms used herein have the 

same meaning as commonly understood to one of ordinary skill in the art to which the 
invention belongs. Although any methods, devices and materials similar or equivalent to 
those described herein can be used in the practice or testing of the invention, the 
preferred methods, devices and materials are now described. 

30 All publications mentioned herein are incorporated herein by reference in full for 

the purpose of describing and disclosing the databases, proteins, and methodologies, 
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which are described in the publications which might be used in connection with the 
presently described invention. The publications discussed above and throughout the text 
are provided solely for their disclosure prior to the filing date of the present application. 
Nothing herein is to be construed as an admission that the inventors are not entitled to 
5 antedate such disclosure by virtue of prior invention. 

An "amino acid" is a molecule having the structure wherein a central carbon 
atom (the a-carbon atom) is linked to a hydrogen atom, a carboxylic acid group (the 
carbon atom of which is referred to herein as a "carboxyl carbon atom"), an amino group 
(the nitrogen atom of which is referred to herein as an "amino nitrogen atom"), and a 
1 0 side chain group, R. When incorporated into a peptide, polypeptide, or protein, an 
amino acid loses one or more atoms of its amino acid carboxylic groups in the 
dehydration reaction that links one amino acid to another. As a result, when 
incorporated into a protein, an amino acid is referred to as an "amino acid residue." 

"Protein" or "polypeptide" refers to any polymer of two or more individual 
1 5 amino acids (whether or not naturally occurring) linked via a peptide bond, and occurs 
when the carboxyl carbon atom of the carboxylic acid group bonded to the a-carbon of 
one amino acid (or amino acid residue) becomes covalently bound to the amino nitrogen 
atom of amino group bonded to the a-carbon of an adjacent amino acid. The term 
"protein" is understood to include the terms "polypeptide" and "peptide" (which, at 
20 times may be used interchangeably herein) within its meaning. In addition, proteins 

comprising multiple polypeptide subunits (e.g., DNA polymerase III, RNA polymerase 
II) or other components (for example, an RNA molecule, as occurs in telomerase) will 
also be understood to be included within the meaning of "protein" as used herein. 
Similarly, fragments of proteins and polypeptides are also within the scope of the 
25 invention and may be referred to herein as "proteins." 

A particular amino acid sequence of a given protein (i.e., the polypeptide's 
"primary structure," when written from the amino-terminus to carboxy-terminus) is 
determined by the nucleotide sequence of the coding portion of a mRNA, which is in 
turn specified by genetic information, typically genomic DNA (including organelle 
30 DNA, e.g., mitochondrial or chloroplast DNA). Thus, determining the sequence of a 
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gene assists in predicting the primary sequence of a corresponding polypeptide and 
more particular the role or activity of the polypeptide or proteins encoded by that gene or 
polynucleotide sequence. 

The term "isolated" means altered "by the hand of man" from its natural state; 
5 /. e. , if it occurs in nature, it has been changed or removed from its original environment, 
or both. For example, a naturally occurring polynucleotide or a polypeptide naturally 
present in a living animal, a biological sample or an environmental sample in its natural 
state is not "isolated", but the same polynucleotide or polypeptide separated from the 
coexisting materials of its natural state is "isolated", as the term is employed herein. 

10 Such polynucleotides, when introduced into host cells in culture or in whole organisms, 
still would be isolated, as the term is used herein, because they would not be in their 
naturally occurring form or environment. Similarly, the polynucleotides and 
polypeptides may occur in a composition, such as a media formulation (solutions for 
introduction of polynucleotides or polypeptides, for example, into cells or compositions 

15 or solutions for chemical or enzymatic reactions). 

"Polynucleotide" or "nucleic acid sequence" refers to a polymeric form of 
nucleotides. In some instances a polynucleotide refers to a sequence that is not 
immediately contiguous with either of the coding sequences with which it is 
immediately contiguous (one on the 5' end and one on the 3' end) in the naturally 

20 occurring genome of the organism from which it is derived. The term therefore includes, 
for example, a recombinant DNA which is incorporated into a vector; into an 
autonomously replicating plasmid or virus; or into the genomic DNA of a prokaryote or 
eukaryote, or which exists as a separate molecule (e.g., a cDNA) independent of other 
sequences. The nucleotides of the invention can be ribonucleotides, 

25 deoxyribonucleotides, or modified forms of either nucleotide. A polynucleotides as used 
herein refers to, among others, single-and double-stranded DNA, DNA that is a mixture 
of single- and double- stranded regions, single- and double-stranded RNA, and RNA that 
is mixture of single- and double-stranded regions, hybrid molecules comprising DNA 
and RNA that may be single-stranded or, more typically, double-stranded or a mixture of 

30 single- and double-stranded regions. 
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In addition, polynucleotide as used herein refers to triple-stranded regions 
comprising RNA or DNA or both RNA and DNA. The strands in such regions may be 
from the same molecule or from different molecules. The regions may include all of one 
or more of the molecules, but more typically involve onlya region of some of the 
5 molecules. One of the molecules of a triple-helical region often is an oligonucleotide. 
The term polynucleotide encompasses genomic DNA or RNA (depending upon the 
organism, i.e., RNA genome of viruses), as well as mRNA encoded by the genomic 
DNA, and cDNA. 

As mentioned above, there is currently a need in the biotechnology and chemical 
1 0 industry for molecules that can optimally carry out biological or chemical processes 

{e.g., enzymes). Identifying novel enzymes in an environmental sample is one solution 
to this problem. By rapidly identifying polypeptides having an activity of interest and 
polynucleotides encoding the polypeptide of interest the invention provides methods, 
compositions and sources for the development of biologies, diagnostics, therapeutics, 
1 5 and compositions for industrial applications. 

All classes of molecules and compounds that are utilized in both established and 
emerging chemical, pharmaceutical, textile, food and feed, detergent markets must meet 
stringent economical and environmental standards. The synthesis of polymers, 
pharmaceuticals, natural products and agrochemicals is often hampered by expensive 

20 processes which produce harmful byproducts and which suffer from poor or inefficient 
catalysis. Enzymes, for example, have a number of remarkable advantages which can 
overcome these problems in catalysis: they act on single functional groups, they 
distinguish between similar functional groups on a single molecule, and they distinguish 
between enantiomers. Moreover, they are biodegradable and function at very low mole 

25 fractions in reaction mixtures. Because of their chemo-, regio- and stereospecificity, 
enzymes present a unique opportunity to optimally achieve desired selective 
transformations. These are often extremely difficult to duplicate chemically, especially 
in single-step reactions. The elimination of the need for protection groups, selectivity, 
the ability to carry out multi-step transformations in a single reaction vessel, along with 

30 the concomitant reduction in environmental burden, has led to the increased demand for 
enzymes in chemical and pharmaceutical industries. Enzyme-based processes have been 
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gradually replacing many conventional chemical-based methods. A current limitation to 
more widespread industrial use is primarily due to the relatively small number of 
commercially available enzymes. Only -300 enzymes (excluding DNA modifying 
enzymes) are at present commercially available from the > 3000 non DNA-modifying 
5 enzyme activities thus far described. 

The use of enzymes for technological applications also may require performance 
under demanding industrial conditions. This includes activities in environments or on 
substrates for which the currently known arsenal of enzymes was not evolutionarily 
selected. However, the natural environment provides extreme conditions including, for 
1 0 example, extremes in temperature and pH. A number of organisms have adapted to 
these conditions due in part to selection for polypeptides than can withstand these 
extremes. 

Enzymes have evolved by selective pressure to perform very specific biological 
functions within the milieu of a living organism, under conditions of temperature, pH 

1 5 and salt concentration. For the most part, the non-DNA modifying enzyme activities 

thus far described have been isolated from mesophilic organisms, which represent a very 
small fraction of the available phylogenetic diversity. The dynamic field of biocatalysis 
takes on a new dimension with the help of enzymes isolated from microorganisms that 
thrive in extreme environments. For example, such enzymes must function at 

20 temperatures above 100°C in terrestrial hot springs and deep sea thermal vents, at 

temperatures below 0°C in arctic waters, in the saturated salt environment of the Dead 
Sea, at pH values around 0 in coal deposits and geothermal sulfur-rich springs, or at pH 
values greater than 1 1 in sewage sludge. Environmental samples obtained, for example, 
from extreme conditions containing organisms, polynucleotides or polypeptides {e.g., 

25 enzymes) open a new field in biocatalysis. By rapidly screening for polynucleotides 
encoding polypeptides of interest, the invention provides not only a source of materials 
for the development of biologies, therapeutics, and enzymes for industrial applications, 
but also provides a new materials for further processing by, for example, directed 
evolution and mutagenesis to develop molecules or polypeptides modified for particular 

30 activity or conditions. 
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In addition to the need for new enzymes for industrial use, there has been a 
dramatic increase in the need for bioactive compounds with novel activities. This 
demand has arisen largely from changes in worldwide demographics coupled with the 
clear and increasing trend in the number of pathogenic organisms that are resistant to . _ _ 

5 currently available antibiotics. For example, while there has been a surge in demand for 
antibacterial drugs in emerging nations with young populations, countries with aging 
populations, such as the U.S., require a growing repertoire of drugs against cancer, 
diabetes, arthritis and other debilitating conditions. The death rate from infectious 
diseases has increased 58% between 1980 and 1992 and it has been estimated that the 

1 0 emergence of antibiotic resistant microbes has added in excess of $30 billion annually to 

the cost of health care in the U.S. alone. (Adams et al. y Chemical and Engineering \ 

News, 1995; Amann et al, Microbiological Reviews, 59, 1995). As a response to this 

trend pharmaceutical companies have significantly increased their screening of microbial 

diversity for compounds with unique activities or specificity. Accordingly, the invention _ 

1 5 can be used to obtain and identify polynucleotides and related sequence specific 

information from, for example, infectious microorganisms present in the environment 
such as, for example, in the gut of various macroorganisms. 

In another embodiment, the methods and compositions of the invention provide 
for the identification of lead drug compounds present in an environmental sample. The 
20 methods of the invention provide the ability to mine the environment for novel drugs or . ^ 

identify related drugs contained in different microorganisms. There are several common 
sources of lead compounds (drug candidates), including natural product collections, 
synthetic chemical collections, and synthetic combinatorial chemical libraries, such as 
nucleotides, peptides, or other polymeric molecules that have been identified or 

i 

25 developed as a result of environmental mining. Each of these sources has advantages ? 
and disadvantages. The success of programs to screen these candidates depends largely 
on the number of compounds entering the programs, and pharmaceutical companies 
have to date screened hundred of thousands of synthetic and natural compounds in 
search of lead compounds. Unfortunately, the ratio of novel to previously-discovered 

30 compounds has diminished with time. The discovery rate of novel lead compounds has 
not kept pace with demand despite the best efforts of pharmaceutical companies. There 
exists a strong need for accessing new sources of potential drug candidates. 
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Accordingly, the invention provides a rapid and efficient method to identify and 
characterize environmental samples that may contain novel drug compounds. 

The majority of bioactive compounds currently in use are derived from soil 
microorganisms. Many microbes inhabiting soils and other complex ecological 
5 communities produce a variety of compounds that increase their ability to survive and 
proliferate. These compounds are generally thought to be nonessential for growth of the 
organism and are synthesized with the aid of genes involved in intermediary metabolism 
hence their name - "secondary metabolites". Secondary metabolites that influence the 
growth or survival of other organisms are known as "bioactive" compounds and serve as 

10 key components of the chemical defense arsenal of both micro- and macro-organisms. 
Humans have exploited these compounds for use as antibiotics, antiinfectives and other 
bioactive compounds with activity against a broad range of prokaryotic and eukaryotic 
pathogens. Approximately 6,000 bioactive compounds of microbial origin have been 
characterized, with more than 60% produced by the gram positive soil bacteria of the 

15 genus Streptomyces. (Barnes et aL, Proc. Nat. Acad. Sci. U.S.A., 91, 1994). Of these, at 
least 70 are currently used for biomedical and agricultural applications. The largest class 
of bioactive compounds, the polyketides, include a broad range of antibiotics, 
immunosuppressants and anticancer agents which together account for sales of over $5 
billion per year. 

20 Despite the seemingly large number of available bioactive compounds, it is clear 

that one of the greatest challenges facing modem biomedical science is the proliferation 
of antibiotic resistant pathogens. Because of their short generation time and ability to 
readily exchange genetic information, pathogenic microbes have rapidly evolved and 
disseminated resistance mechanisms against virtually all classes of antibiotic 

25 compounds. For example, there are virulent strains of the human pathogens 

Staphylococcus and Streptococcus that can now be treated with but a single antibiotic, 
vancomycin, and resistance to this compound will require only the transfer of a single 
gene, vanA, from resistant Enterococcus species for this to occur. (Bateson et aL, 
System. Appl Microbiol, 12, 1989). When this crucial need for novel antibacterial 

30 compounds is superimposed on the growing demand for enzyme inhibitors, 
immunosuppressants and anti-cancer agents it becomes readily apparent why 
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pharmaceutical companies have stepped up their screening of microbial samples for 
bioactive compounds. 

The invention provides methods of identifying a nucleic acid sequence encoding 
a polypeptide having either known or unknown function. For example, much of the 
5 diversity in microbial genomes results from the rearrangement of gene clusters in the 
genome of microorganisms. These gene clusters can be present across species or 
phylogenetically related with other organisms. 

For example, bacteria and many eukaryotes have a coordinated mechanism for 
regulating genes whose products are involved in related processes. The genes are 

10 clustered, in structures referred to as "gene clusters," on a single chromosome and are 
transcribed together under the control of a single regulatory sequence, including a single 
promoter which initiates transcription of the entire cluster. The gene cluster, the 
promoter, and additional sequences that function in regulation altogether are referred to 
as an "operon" and can include up to 20 or more genes, usually from 2 to 6 genes. Thus, 

15 a gene cluster is a group of adjacent genes that are either identical or related, usually as 
to their function. 

Some gene families consist of identical members. Clustering is a prerequisite for 
maintaining identity between genes, although clustered genes are not necessarily 
identical. Gene clusters range from extremes where a duplication is generated to 
20 adjacent related genes to cases where hundreds of identical genes lie in a tandem array. 
Sometimes no significance is discernable in a repetition of a particular gene. A principal 
example of this is the expressed duplicate insulin genes in some species, whereas a 
single insulin gene is adequate in other mammalian species. 

Further, gene clusters undergo continual reorganization and, thus, the ability to 
25 create heterogeneous libraries of gene clusters from, for example, bacterial or other 
prokaryote sources is valuable in determining sources of novel proteins, particularly 
including enzymes such as, for example, the polyketide synthases that are responsible for 
the synthesis of polyketides having a vast array of useful activities. Other types of 
proteins that are the product(s) of gene clusters are also contemplated, including, for 
30 example, antibiotics, antivirals, antitumor agents and regulatory proteins, such as insulin. 
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As an example, polyketide synthases enzymes fall in a gene cluster. Polyketides 
are molecules which are an extremely rich source of bioactivities, including antibiotics 
(such as tetracyclines and erythromycin), anti-cancer agents (daunomycin), 
immunosuppressants (FK506 and rapamycin), and veterinary products (monensin). 
5 Many polyketides (produced by polyketide synthases) are valuable as therapeutic agents. 
Polyketide synthases are multifunctional enzymes that catalyze the biosynthesis of a 
huge variety of carbon chains differing in length and patterns of functionality and 
cyclization. Polyketide synthase genes fall into gene clusters and at least one type 
(designated type I) of polyketide synthases have large size genes and enzymes, 
10 complicating genetic manipulation and in vitro studies of these genes/proteins. 

The ability to select and combine desired components from a library of 
polyketides and postpolyketide biosynthesis genes for generation of novel polyketides 
for study is appealing. The method(s) of the present invention make it possible to, and 
facilitate the cloning of, novel polyketide synthases, since one can generate gene banks 
1 5 with clones containing large inserts (especially when using the f-factor based vectors), 
which facilitates cloning of gene clusters. 

For example, a gene cluster can be ligated into a vector containing an expression 
regulatory sequences which can control and regulate the production of a detectable 
protein or protein-related array activity from the ligated gene clusters. Use of vectors 

20 which have an exceptionally large capacity for exogenous nucleic acid introduction are 
particularly appropriate for use with such gene clusters and are described by way of 
example herein to include the f-factor (or fertility factor) of E. coli. This f-factor of E. 
coli is a plasmid which affects high-frequency transfer of itself during conjugation and is 
ideal to achieve and stably propagate large nucleic acid fragments, such as gene clusters 

25 from mixed microbial samples. 

The nucleic acid isolated or derived from these samples {e.g., a mixed population 
of microorganisms) can preferably be inserted into a vector or a plasmid prior to 
screening of the polynucleotides. Such vectors or plasmids are typically those 
containing expression regulatory sequences, including promoters, enhancers and the like. 
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Accordingly, the invention provides novel systems to clone and screen mixed 
populations of organisms present, for example, in an environmental samples, for 
polynucleotides of interest, enzymatic activities and bioactivities of interest in vitro. The 
method(s) of the invention allow the cloning and discovery of novel bioactive molecules 
5 in vitro, and in particular novel bioactive molecules derived from uncultivated or 

cultivated samples. Large size gene clusters, genes and gene fragments can be cloned, 
sequenced and screened using the method(s) of the invention. Unlike previous 
strategies, the method(s) of the invention allow one to clone, screen and identify 
polynucleotides and the polypeptides encoded by these polynucleotides in vitro from a 
1 0 wide range of environmental samples. 

The invention allows one to screen for and identify polynucleotide sequences 
from complex environmental samples. DNA libraries obtained from these samples can 
be created from cell free samples, so long as the sample contains nucleic acid sequences, 
or from samples containing cellular organisms or viral particles. The organisms from 

1 5 which the libraries may be prepared include prokaryotic microorganisms, such as 

Eubacteria and Archaebacteria, lower eukaryotic microorganisms such as fungi, algae 
and protozoa, as well as mixed populations of plants, plant spores and pollen. The 
organisms may be cultured organisms or uncultured organisms obtained from 
environmental samples and includes extremophiles, such as thermophiles, 

20 hyperthermophiles, psychrophiles and psychrotrophs. 

Sources of nucleic acids used to construct a DNA library can be obtained from 
environmental samples, such as, but not limited to, microbial samples obtained from 
Arctic and Antarctic ice, water or permafrost sources, materials of volcanic origin, 
materials from soil or plant sources in tropical areas, droppings from various organisms 
25 including mammals, invertebrates, as well as dead and decaying matter etc. Thus, for 
example, nucleic acids may be recovered from either a cultured or non-cultured 
organism and used to produce an appropriate DNA library (e.g., a recombinant 
expression library) for subsequent determination of the identity of the particular 
polynucleotide sequence or screening for enzyme activity. 
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The following outlines a general procedure for producing libraries from both 
culturable and non-culturable organisms as well as mixed population of organisms, 
which libraries can be probed, sequenced or screened to select therefrom nucleic acid 
sequences having an identified, desired or predicted biological activity (e.g., an 
5 enzymatic activity). 

As used herein an environmental sample is any sample containing organisms or 
polynucleotides or a combination thereof. Thus, an environmental sample can be 
obtained from any number of sources (as described above), including, for example, 
insect feces, soil, water, etc. Any source of nucleic acids in purified or non-purified 

10 form can be utilized as starting material. Thus, the nucleic acids may be obtained from 
any source which is contaminated by an organism or from any sample containing cells. 
The environmental sample can be an extract from any bodily sample such as blood, 
urine, spinal fluid, tissue, vaginal swab, stool, amniotic fluid or buccal mouthwash from 
any mammalian organism. For non-mammalian (e.g., invertebrates) organisms the 

1 5 sample can be a tissue sample, salivary sample, fecal material or material in the digestive 

tract of the organism. An environmental sample also includes samples obtained from ^ 
extreme environments including, for example, hot sulfur pools, volcanic vents, and j 
frozen tundra. In addition, the sample can come from a variety of sources. For example, T 
in horticulture and agricultural testing the sample can be a plant, fertilizer, soil, liquid or f, 

20 other horticultural or agricultural product; in food testing the sample can be fresh food or ^ \ 

processed food (for example infant formula, seafood, fresh produce and packaged food); 
and in environmental testing the sample can be liquid, soil, sewage treatment, sludge and 
any other sample in the environment which is considered or suspected of containing an 
organism or polynucleotides. 

25 When the sample is a mixture of material (e.g., a mixed population of 

organisms), for example, blood, soil and sludge, it can be treated with an appropriate 
reagent which is effective to open the cells and expose or separate the strands of nucleic 
acids. Although not necessary, this lysing and nucleic acid denaturing step will allow 
cloning, amplification or sequencing to occur more readily. Further, if desired, the mixed 

30 population can be cultured prior to analysis in order to purify a particular population and 
thus obtaining a purer sample. However, this is not necessary. For example, culturing of 
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organisms in the sample can include culturing the organisms in microdroplets and 
separating the cultured microdroplets with a cell sorter into individual wells of a multi- 
well tissue culture plate from which further processing may be performed. 

Accordingly, the sample comprises nucleic acids from, for example, a diverse 
5 and mixed population of organisms {e.g., microorganisms present in the gut of an 
insect). Nucleic acids are isolated from the sample using any number of methods for 
DNA and RNA isolation. Such nucleic acid isolation methods are commonly performed 
in the art. Where the nucleic acid is RNA, the RNA can be reversed transcribed to DNA 
using primers known in the art. Where the DNA is genomic DNA, the DNA can be 
1 0 sheared using, for example, a 25 gauge needle. 

The nucleic acids are then cloned into an appropriate vector. The vector used 
will depend upon whether the DNA is to be expressed, amplified, sequenced or 
manipulated in any number of ways known in the art (see, for example, U.S. Patent No. 
6,022,71 6 which discloses high throughput sequencing vectors). Cloning techniques are 

1 5 known in the art or can be developed by one skilled in the art, without undue 
experimentation. The choice of a vector will also depend on the size of the 
polynucleotide sequence and the host cell to be employed in the methods of the 
invention. Thus, the vector used in the invention may be plasmids, phages, cosmids, 
phagemids, viruses {e.g., retroviruses, parainfluenzavirus, herpesviruses, reo viruses, 

20 paramyxoviruses, and the like), or selected portions thereof {e.g., coat protein, spike 

glycoprotein, capsid protein). For example, cosmids and phagemids are typically used 
where the specific nucleic acid sequence to be analyzed or modified is large because 
these vectors are able to stably propagate large polynucleotides. 

The vector containing the cloned DNA sequence can then be amplified by 
25 plating {i.e., clonal amplification) or transfecting a suitable host cell with the vector {e.g., 
a phage on an E. coli host). Alternatively (or subsequently to amplification), the cloned 
DNA sequence is used to prepare a library for screening by transforming a suitable 
organism. Hosts, known in the art are transformed by artificial introduction of the 
vectors containing the target nucleic acid by inoculation under conditions conducive for 
30 such transformation. One could transform with double stranded circular or linear nucleic 
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acid or there may also be instances where one would transform with single stranded 
circular or linear nucleic acid sequences. By transform or transformation is meant a 
permanent or transient genetic change induced in a cell following incorporation of new 
DNA (i. e. , DNA exogenous to the cell). Where the cell is a mammalian cell, a 
5 permanent genetic change is generally achieved by introduction of the DNA into the 
genome of the cell. A transformed cell or host cell generally refers to a cell (e.g., 
prokaryotic or eukaryotic) into which (or into an ancestor of which) has been introduced, 
by means of recombinant DNA techniques, a DNA molecule not normally present in the 
host organism. 

10 A particularly type of vector for use in the invention contains an f-factor origin 

replication. The f-factor (or fertility factor) in R coli is a plasmid which effects high 
frequency transfer of itself during conjugation and less frequent transfer of the bacterial 
chromosome itself. In a particular embodiment cloning vectors referred to as "fosmids" 
or bacterial artificial chromosome (BAC) vectors are used. These are derived from E. 

1 5 coli f-factor which is able to stably integrate large segments of DNA. When integrated 
with DNA from a mixed uncultured environmental sample, this makes it possible to 
achieve large genomic fragments in the form of a stable "environmental DNA library." 

The nucleic acids derived from a mixed population or sample may be inserted 
into the vector by a variety of procedures. In general, the nucleic acid sequence is 

20 inserted into an appropriate restriction endonuclease site(s) by procedures known in the 
art. Such procedures and others are deemed to be within the scope of those skilled in the 
art. A typical cloning scenario may have the DNA "blunted" with an appropriate 
nuclease (e.g., Mung Bean Nuclease), methylated with, for example, EcoR I Methylase 
and ligated to EcoR I linkers GGAATTCC (SEQ ID NO: 1 ). The linkers are then 

25 digested with an EcoR / Restriction Endonuclease and the DNA size fractionated (e.g., 
using a sucrose gradient). The resulting size fractionated DNA is then ligated into a 
suitable vector for sequencing, screening or expression (e.g., a lambda vector and 
packaged using an in vitro lambda packaging extract). 

Transformation of a host cell with recombinant DNA may be carried out by 
30 conventional techniques as are well known to those skilled in the art. Where the host is 
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prokaryotic, such as E. coli, competent cells which are capable of DNA uptake can be 
prepared from cells harvested after exponential growth phase and subsequently treated 
by the CaCh method by procedures well known in the art. Alternatively, MgCb or RbCl 
can be used. Transformation can also be performed after forming a protoplast of the host 
5 cell or by electroporation. 

When the host is a eukaryote, methods of transfection or transformation with 
DNA include calcium phosphate co-precipitates, conventional mechanical procedures 
such as microinjection, electroporation, insertion of a plasmid encased in liposomes, or 
virus vectors, as well as others known in the art, may be used. Eukaryotic cells can also 

10 be cotransfected with a second foreign DNA molecule encoding a selectable marker, 
such as the herpes simplex thymidine kinase gene. Another method is to use a 
eukaryotic viral vector, such as simian virus 40 (SV40) or bovine papilloma virus, to 
transiently infect or transform eukaryotic cells and express the protein. (Eukaryotic Viral 
Vectors, Cold Spring Harbor Laboratory, Gluzman ed., 1982). The eukaryotic cell may 

15 be a yeast cell (e.g., Saccharomyces cerevisiae), an insect cell (e.g., Drosophila sp.) or 
may be a mammalian cell, including a human cell. 

Eukaryotic systems, and mammalian expression systems, allow for post- 
translational modifications of expressed mammalian proteins to occur. Eukaryotic cells 
which possess the cellular machinery for processing of the primary transcript, 
20 glycosylation, phosphorylation, and, advantageously secretion of the gene product 

should be used. Such host cell lines may include, but are not limited to, CHO, VERO, 
BHK, HeLa, COS, MDCK, Jurkat, HEK-293, and WI38. 
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Biopanning 



After the expression libraries have been generated one can perform "biopanning" 
of the libraries prior to expression screening. The "biopanning" procedure refers to a 
process for identifying clones having a specified biological activity by screening for 
5 sequence homology in the library of clones, using at least one probe DNA comprising at 
least a portion of a DNA sequence encoding a polypeptide having the specified 
biological activity; and detecting interactions with the probe DNA to a substantially 
complementary sequence in a clone. Clones (either viable or non-viable) are then 
separated by a fluorescence analyzer (e.g., a FACS apparatus). 

1 0 The probe DNA used to probe for the target DNA of interest contained in clones 

prepared from polynucleotides in a mixed population of organisms can be a full-length 
coding region sequence or a partial coding region sequence of DNA for a known 
bioactivity. The sequence of the probe can be generated by synthetic or recombinant 
means and can be based upon computer based sequencing programs or biological 

1 5 sequences present in a clone. The DNA library can be probed using mixtures of probes 
comprising at least a portion of the DNA sequence encoding a known bioactivity having 
a desired activity. These probes or probe libraries are preferably single-stranded. The 
probes that are particularly suitable are those derived from DNA encoding bioactivities 
having an activity similar or identical to the specified bioactivity which is to be screened. 

20 In another embodiment, the polynucleotides are contained in clones, the clones 

having been prepared from nucleic acid sequences of a mixed population of organisms, 
wherein the nucleic acid sequences are used to prepare a DNA library of the mixed 
population of organisms. The DNA library is screened for a sequence of interest by 
transfecting a host cell containing the library with at least one labeled nucleic acid 

25 sequence which is all or a portion of a DNA sequence encoding a bioactivity having a 
desirable activity and separating the library clones containing the desirable sequence by 
fluorescent based analysis. 

In another embodiment, in vivo biopanning may be performed utilizing a 
FACS -based machine. Complex gene libraries are constructed with vectors which 
30 contain elements which stabilize transcribed RNA. For example, the inclusion of 
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sequences which result in secondary structures such as hairpins which are designed to 
flank the transcribed regions of the RNA would serve to enhance their stability, thus 
increasing their half life within the cell. The probe molecules used in the biopanning 
process consist of oligonucleotides labeled with reporter molecules that only-fluoresce 
5 upon binding of the probe to a target molecule. Various dyes or stains well known in 
the art, for example those described in "Practical Flow Cytometry", 1995 Wiley-Liss, 
Inc., Howard M. Shapiro, M.D., can be used to intercalate or associate with nucleic 
acid in order to "label" the oligonucleotides. These probes are introduced into the 
recombinant cells of the library using one of several transformation methods. The 
1 0 probe molecules interact or hybridize to the transcribed target mRNA or DNA 
resulting in DNA/RNA heteroduplex molecules or DNA/DNA duplex molecules. 
Binding of the probe to a target will yield a fluorescent signal which is detected and 
sorted by the FACS machine during the screening process. 

1 5 The probe DNA should be at least about 10 bases and preferably at least 1 5 

bases. In one embodiment, an entire coding region of one part of a pathway may be 
employed as a probe. Where the probe is hybridized to the target DNA in an in vitro 
system, conditions for the hybridization in which target DNA is selectively isolated by 
the use of at least one DNA probe will be designed to provide a hybridization stringency 

20 of at least about 50% sequence identity, more particularly a stringency providing for a 
sequence identity of at least about 70%. Hybridization techniques for probing a 
microbial DNA library to isolate target DNA of potential interest are well known in the 
art and any of those which are described in the literature are suitable for use herein. 
Prior to fluorescence sorting the clones may be viable or non-viable. For example, in 

25 one embodiment, the cells are fixed with paraformaldehyde prior to sorting. 

Once viable or non-viable clones containing a sequence substantially 
complementary to the probe DNA are separated by a fluorescence analyzer, 
polynucleotides present in the separated clones may be further manipulated. In some 
30 instances, it may be desirable to perform an amplification of the target DNA that has 
been isolated. In this embodiment, the target DNA is separated from the probe DNA 
after isolation. In one embodiment, the clone can be grown to expand the clonal 
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population. Alternatively, the host cell is lysed and the target DNA amplified. It is then 
amplified before being used to transform a new host (e.g., subcloning). Long PCR 
(Barnes, W M, Proc. Natl. Acad. Sci, USA, Mar. 15, 1994 ) can be used to amplify large 
DNA fragments (e.g., 35 kb). Numerous amplification methodologies are now well- 
5 known in the art. 

Where the target DNA is identified in vitro, the selected DNA is then used for 
preparing a library for further processing and screening by transforming a suitable 
organism. Hosts, particularly those specifically identified herein as preferred, are 
transformed by artificial introduction of a vector containing a target DNA by inoculation 
1 0 under conditions conducive for such transformation. 

The resultant libraries (enriched for a polynucleotide of interest) can then be 
screened for clones which display an activity of interest. Clones can be shuttled in 
alternative hosts for expression of active compounds, or screened using methods 
described herein. 

1 5 Having prepared a multiplicity of clones from DNA selectively isolated via 

FACS-based hybridization, such clones are screened for a specific activity to identify 
clones having a specified characteristic. 

The screening for activity may be effected on individual expression clones or 
may be initially effected on a mixture of expression clones to ascertain whether or not 
20 the mixture has one or more specified activities. If the mixture has a specified activity, 
then the individual clones may be rescreened for such activity or for a more specific 
activity. 

Prior to, subsequent to or as an alternative to the in vivo biopanning described 
above is an encapsulation techniques such as GMDs, which may be employed to localize 
25 at least one clone in one location for growth or screening by a fluorescent analyzer (e.g. 
FACS). The separated at least one clone contained in the GMD may then be cultured to 
expand the number of clones or screened on a FACS machine to identify clones 
containing a sequence of interest as described above, which can then be broken out into 
individual clones to be screened again on a FACS machine to identify positive individual 
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clones. Screening in this manner using a FACS machine is described in patent 
application Ser. No. 08/876,276, filed June 16, 1997. Thus, for example, if a clone has a 
desirable activity, then the individual clones may be recovered and rescreened utilizing a 
FACS machine to determine which of such clones has the specified desirable activity. 

5 Further, it is possible to combine some or all of the above embodiments such 

that a normalization step is performed prior to generation of the expression library, the 
expression library is then generated, the expression library so generated is then 
biopanned, and the biopanned expression library is then screened using a high 
throughput cell sorting and screening instrument. Thus there are a variety of options, 
10 including: (i) generating the library and then screening it; (ii) normalize the target 
DNA, generate the expression library and screen it; (iii) normalize, generate the 
library, biopan and screen; or (iv) generate, biopan and screen the library. 

The library may, for example, be screened for a specified enzyme activity. 
1 5 For example, the enzyme activity screened for may be one or more of the six IUB 

classes; oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases. The 
recombinant enzymes which are determined to be positive for one or more of the IUB 
classes may then be rescreened for a more specific enzyme activity. 

20 Alternatively, the library may be screened for a more specialized enzyme 

activity. For example, instead of generically screening for hydrolase activity, the 
library may be screened for a more specialized activity, i.e. the type of bond on which 
the hydrolase acts. Thus, for example, the library may be screened to ascertain those 
hydrolases which act on one or more specified chemical functionalities, such as: (a) 

25 amide (peptide bonds), i.e. proteases; (b) ester bonds, i.e. esterases and lipases; (c) 
acetals, i.e., glycosidases etc. 

As described with respect to one of the above aspects, the invention provides a 
process for activity screening of clones containing selected DNA derived from a mixed 
30 population of organisms or more than one organism. 
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Biopanning polynucleotides from a mixed population of organisms by separating 
the clones or polynucleotides positive for sequence of interest with a fluorescent 
analyzer that detects fluorescence, to select polynucleotides or clones containing 
polynucleotides positive for a sequence of interest, and screening the selected clones or 
polynucleotides for specified bioactivity. In one embodiment, the polynucleotides are 
contained in clones having been prepared by recovering DNA of a microorganism, 
which DNA is selected by hybridization to at least one DNA sequence which is all or a 
portion of a DNA sequence encoding a bioactivity having a desirable activity. 

In another embodiment, a DNA library derived from a microorganism is 
subjected to a selection procedure to select therefrom DNA which hybridizes to one or 
more probe DNA sequences which is all or a portion of a DNA sequence encoding an 
activity having a desirable activity by: 

(a) contacting a DNA library with a fluorescent labeled DNA probe under 
conditions permissive of hybridization so as to produce a double-stranded complex of 
probe and members of the DNA library. 

Screening 

The present invention offers the ability to screen for many types of bioactivities. 
For instance, the ability to select and combine desired components from a library of 
polyketides and postpolyketide biosynthesis genes for generation of novel polyketides 
for study is appealing. The method(s) of the present invention make it possible to and 
facilitate the cloning of novel polyketide synthases, and other relevant pathways or genes 
encoding commercially relevant secondary metabolites, since one can generate gene 
banks with clones containing large inserts (especially when using vectors which can 
accept large inserts, such as the f-factor based vectors), which facilitates cloning of gene 
clusters. 

The biopanning approach described above can be used to create libraries 
enriched with clones carrying sequences substantially homologous to a given probe 
sequence. Using this approach libraries containing clones with inserts of up to 40 kbp 
can be enriched approximately 1,000 fold after each, round of panning. This enables one 
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to reduce the number of clones to be screened after 1 round of biopanning enrichment. 
This approach can be applied to create libraries enriched for clones carrying sequence of 
interest related to a bioactivity of interest, for example, polyketide sequences. 

Hybridization screening using high density filters or biopanning has proven an 
5 efficient approach to detect homologues of pathways containing genes of interest to 
discover novel bioactive molecules that may have no known counterparts. Once a 
polynucleotide of interest is enriched in a library of clones it may be desirable to screen 
for an activity. For example, it may be desirable to screen for the expression of small 
molecule ring structures or "backbones". Because the genes encoding these polycyclic 

1 0 structures can often be expressed in E. coli the small molecule backbone can be 

manufactured albeit in an inactive form. Bioactivity is conferred upon transferring the 
molecule or pathway to an appropriate host that expresses the requisite glycosylation and 
methylation genes that can modify or "decorate" the structure to its active form. Thus, 
inactive ring compounds, recombinantly expressed in E. coli are detected to identify 

1 5 clones which are then shuttled to a metabolically rich host, such as Streptomyces, for 
subsequent production of the bioactive molecule. The use of high throughput robotic 
systems allows the screening of hundreds of thousands of clones in multiplexed arrays in 
microtiter dishes. 

One approach to detect and enrich for clones carrying these structures is to use 
20 FACS screening, a procedure described and exemplified in U.S. Ser. No. 08/876,276, 

filed June 16, 1997. Polycyclic ring compounds typically have characteristic fluorescent 
spectra when excited by ultraviolet light. Thus, clones expressing these structures can be 
distinguished from background using a sufficiently sensitive detection method. High 
throughput FACS screening can be utilized to screen for small molecule backbones in E. 
25 coli libraries. Commercially available FACS machines are capable of screening up to 
100,000 clones per second for UV active molecules. These clones can be sorted for 
further FACS screening or the resident plasmids can be extracted and shuttled to 
Streptomyces for activity screening. 

In an alternate screening approach, after shuttling to Streptomyces hosts, organic 
30 extracts from candidate clones can be tested for bioactivity by susceptibility screening 
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against test organisms such as Staphylococcus aureus, E. coli, or Saccharomyces 
cervisiae. FACS screening can be used in this approach by co-encapsulating clones with 
the test organism. 

An alternative to the above-mentioned screening methods provided by the 
5 present invention is an approach termed "mixed extract" screening. The "mixed extract' 1 
screening approach takes advantage of the fact that the accessory genes needed to confer 
activity upon the polycyclic backbones are expressed in metabolically rich hosts, such as 
Streptomyces, and that the enzymes can be extracted and combined with the backbones 
extracted from E. coli clones to produce the bioactive compound in vitro. Enzyme 
10 extract preparations from metabolically rich hosts, such as Streptomyces strains, at 

various growth stages are combined with pools of organic extracts from E. coli libraries 
and then evaluated for bioactivity. 

Another approach to detect activity in the E. coli clones is to screen for genes 
that can convert bioactive compounds to different forms. For example, a recombinant 
1 5 enzyme was recently discovered that can convert the low value daunomycin to the 
higher value doxorubicin. Similar enzyme pathways are being sought to convert 
penicillins to cephalosporins. 

Screening may be carried out to detect a specified enzyme activity by procedures 
known in the art. For example, enzyme activity may be screened for one or more of the 

20 six IUB classes; oxidoreductases, transferases, hydrolases, lyases, isomerases and 

ligases. The recombinant enzymes which are determined to be positive for one or more 
of the IUB classes may then be rescreened for a more specific enzyme activity. 
Alternatively, the library may be screened for a more specialized enzyme activity. For 
example, instead of generically screening for hydrolase activity, the library may be 

25 screened for a more specialized activity, i.e. the type of bond on which the hydrolase 

acts. Thus, for example, the library may be screened to ascertain those hydrolases which 
act on one or more specified chemical functionalities, such as: (a) amide (peptide bonds), 
i.e. proteases; (b) ester bonds, i.e. esterases and lipases; (c) acetals, ie., glycosidases. 

FACS screening can also be used to detect expression of UV fluorescent 
30 molecules in metabolically rich hosts, such as Streptomyces. Recombinant oxytetracylin 
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retains its diagnostic red fluorescence when produced heterologously in S. lividans 
TK24. Pathway clones, which can be sorted by FACS, can thus be screened for 
polycyclic molecules in a high throughput fashion. 

Recombinant bioactive compounds can also be screened in vivo using "two- 
5 hybrid" systems, which can detect enhancers and inhibitors of protein-protein or other 
interactions such as those between transcription factors and their activators, or receptors 
and their cognate targets. In this embodiment, both the small molecule pathway and the 
GFP reporter construct are co-expressed. Clones altered in GFP expression can then be 
sorted by FACS and the pathway clone isolated for characterization. 

10 As indicated, common approaches to drug discovery involve screening assays in 

which disease targets (macromolecules implicated in causing a disease) are exposed to 
potential drug candidates which are tested for therapeutic activity. In other approaches, 
whole cells or organisms that are representative of the causative agent of the disease, 
such as bacteria or tumor cell lines, are exposed to the potential candidates for screening 

1 5 purposes. Any of these approaches can be employed with the present invention. 

The present invention also allows for the transfer of cloned pathways derived 
from uncultivated samples into metabolically rich hosts for heterologous expression and 
downstream screening for bioactive compounds of interest using a variety of screening 
approaches briefly described above. 

20 Recovering Desirable Bioactivities 

After viable or non- viable cells, each containing a different expression clone 
from the gene library are screened, and positive clones are recovered, DNA can be 
isolated from positive clones utilizing techniques well known in the art. The DNA can 
then be amplified either in vivo or in vitro by utilizing any of the various amplification 
25 techniques known in the art. In vivo amplification would include transformation of the 
clone(s) or subclone(s) into a viable host, followed by growth of the host. In vitro 
amplification can be performed using techniques such as the polymerase chain reaction. 
Once amplified the identified sequences can be "evolved" or sequenced. 

Evolution 
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One advantage afforded by present invention is the ability to manipulate the 
identified polynucleotides to generate and select for encoded variants with altered 
activity or specificity. 



Clones found to have the bioactivity for which the screen was performed can be 
5 subjected to directed mutagenesis to develop new bioactivities with desired properties or 
to develop modified bioactivities with particularly desired properties that are absent or 
less pronounced in the wild-type activity, such as stability to heat or organic solvents. 
Any of the known techniques for directed mutagenesis are applicable to the invention. 
For example, particularly preferred mutagenesis techniques for use in accordance with 
1 0 the invention include those described below. 

Alternatively, it may be desirable to variegate a polynucleotide sequence 
obtained, identified or cloned as described herein. Such variegation can modify the 
polynucleotide sequence in order to modify (e.g., increase or decrease) the encoded 
polypeptide's activity, specificity, affinity, function, etc. DNA shuffling can be used to 

1 5 increase variation in a particular sample. DNA shuffling is meant to indicate 

recombination between substantially homologous but non-identical sequences, in some 
embodiments DNA shuffling may involve crossover via non-homologous 
recombination, such as via cer/lox and/or flp/frt systems and the like (see, for example, 
U.S. Patent No. 5,939,250, issued to Dr. Jay Short on August 17, 1999, and assigned to 

20 Diversa Corporation, the disclosure of which is incorporated herein by reference). 

Various methods for shuffling, mutating or variegating polynucleotide sequences are 
discussed below. 

Nucleic acid shuffling is a method for in vitro or in vivo homologous 
recombination of pools of shorter or smaller polynucleotides to produce a 
25 polynucleotide or polynucleotides. Mixtures of related nucleic acid sequences or 

polynucleotides are subjected to sexual PCR to provide random polynucleotides, and 
reassembled to yield a library or mixed population of recombinant hybrid nucleic acid 
molecules or polynucleotides. 
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In contrast to cassette mutagenesis, only shuffling and error-prone PCR allow 
one to mutate a pool of sequences blindly (without sequence information other than 
primers). 

5 The advantage of the mutagenic shuffling of the invention over error-prone 

PCR alone for repeated selection can best be explained as follows. Consider DNA 
shuffling as compared with error-prone PCR (not sexual PCR). The initial library of 
selected pooled sequences can consist of related sequences of diverse origin or can be 
derived by any type of mutagenesis (including shuffling) of a single gene. A 
1 0 collection of selected sequences is obtained after the first round of activity selection. 
Shuffling allows the free combinatorial association of all of the related sequences, for 
example. 

This method differs from error-prone PCR, in that it is an inverse chain 
1 5 reaction. In error-prone PCR, the number of polymerase start sites and the number of 
molecules grows exponentially. However, the sequence of the polymerase start sites 
and the sequence of the molecules remains essentially the same. In contrast, in 
nucleic acid reassembly or shuffling of random polynucleotides the number of start 
sites and the number (but not size) of the random polynucleotides decreases over time. 
20 For polynucleotides derived from whole plasmids the theoretical endpoint is a single, 
large concatemeric molecule. 

Since cross-overs occur at regions of homology, recombination will primarily 
occur between members of the same sequence family. This discourages combinations 
25 of sequences that are grossly incompatible (e.g., having different activities or 

specificities). It is contemplated that multiple families of sequences can be shuffled in 
the same reaction. Further, shuffling generally conserves the relative order. 

Rare shufflants will contain a large number of the best molecules (e.g., highest 
30 activity or specificity) and these rare shufflants may be selected based on their 
superior activity or specificity. 
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A pool of 100 different polypeptide sequences can be permutated in up to 10 3 
different ways. This large number of permutations cannot be represented in a single 
library of DNA sequences. Accordingly, it is contemplated that multiple cycles of 
DNA shuffling and selection may be required depending on the length of the 
5 sequence and the sequence diversity desired. 

Error-prone PCR, in contrast, keeps all the selected sequences in the same 
relative orientation, generating a much smaller mutant cloud. 

10 The template polynucleotide which may be used in the methods of the 

invention may be DNA or RNA. It may be of various lengths depending on the size 
of the gene or shorter or smaller polynucleotide to be recombined or reassembled. 
Preferably, the template polynucleotide is from 50 bp to 50 kb. It is contemplated that 
entire vectors containing the nucleic acid encoding the protein of interest can be used 

15 in the methods of the invention, and in fact have been successfully used. 

The template polynucleotide may be obtained by amplification using the PCR 
reaction (USPN 4,683,202 and USPN 4,683,195) or other amplification or cloning 
methods. However, the removal of free primers from the PCR products before 
20 subjecting them to pooling of the PCR products and sexual PCR may provide more 
efficient results. Failure to adequately remove the primers from the original pool 
before sexual PCR can lead to a low frequency of crossover clones. 

The template polynucleotide often is double-stranded. A double-stranded 
25 nucleic acid molecule is recommended to ensure that regions of the resulting 
single-stranded polynucleotides are complementary to each other and thus can 
hybridize to form a double-stranded molecule. 

It is contemplated that single-stranded or double-stranded nucleic acid 
30 polynucleotides having regions of identity to the template polynucleotide and regions 
of heterology to the template polynucleotide may be added to the template 
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polynucleotide, at this step. It is also contemplated that two different but related 
polynucleotide templates can be mixed at this step. 

The double-stranded polynucleotide template and any added double-or 
5 single-stranded polynucleotides are subjected to sexual PCR which includes slowing 
or halting to provide a mixture of from about 5 bp to 5 kb or more. Preferably the size 
of the random polynucleotides is from about 10 bp to 1000 bp, more preferably the 
size of the polynucleotides is from about 20 bp to 500 bp. 

10 Alternatively, it is also contemplated that double-stranded nucleic acid having 

multiple nicks may be used in the methods of the invention. A nick is a break in one 
strand of the double-stranded nucleic acid. The distance between such nicks is 
preferably 5 bp to 5 kb, more preferably between 10 bp to 1000 bp. This can provide 
areas of self-priming to produce shorter or smaller polynucleotides to be included 

15 with the polynucleotides resulting from random primers, for example. 

The concentration of any one specific polynucleotide will not be greater than 
1% by weight of the total polynucleotides, more preferably the concentration of any 
one specific nucleic acid sequence will not be greater than 0.1% by weight of the total 
20 nucleic acid. 

The number of different specific polynucleotides in the mixture will be at least 
about 100, preferably at least about 500, and more preferably at least about 1000. 

25 At this step single-stranded or double-stranded polynucleotides, either 

synthetic or natural, may be added to the random double-stranded shorter or smaller 
polynucleotides in order to increase the heterogeneity of the mixture of 
polynucleotides. 

30 It is also contemplated that populations of double-stranded randomly broken 

polynucleotides may be mixed or combined at this step with the polynucleotides from 
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# 



the sexual PCR process and optionally subjected to one or more additional sexual 
PCR cycles. 

Where insertion of mutations into the template polynucleotide is desired, 
5 single-stranded or double-stranded polynucleotides having a region of identity to the 
template polynucleotide and a region of heterology to the template polynucleotide 
may be added in a 20 fold excess by weight as compared to the total nucleic acid, 
more preferably the single-stranded polynucleotides may be added in a 10 fold excess 
by weight as compared to the total nucleic acid. 

10 

Where a mixture of different but related template polynucleotides is desired, 
populations of polynucleotides from each of the templates may be combined at a ratio 
of less than about 1 : 1 00, more preferably the ratio is less than about 1 :40. For 
example, a backcross of the wild-type polynucleotide with a population of mutated 
15 polynucleotide may be desired to eliminate neutral mutations (e.g., mutations yielding 
an insubstantial alteration in the phenotypic property being selected for). In such an 
example, the ratio of randomly provided wild-type polynucleotides which may be 
added to the randomly provided sexual PCR cycle hybrid polynucleotides is 
approximately 1:1 to about 100:1, and more preferably from 1:1 to 40:1. 

20 

The mixed population of random polynucleotides are denatured to form 
single-stranded polynucleotides and then re-annealed. Only those single-stranded 
polynucleotides having regions of homology with other single-stranded 
polynucleotides will re-anneal. 

25 

The random polynucleotides may be denatured by heating. One skilled in the 
art could determine the conditions necessary to completely denature the double- 
stranded nucleic acid. Preferably the temperature is from 80 °C to 100 °C, more 
preferably the temperature is from 90 °C to 96 °C. other methods which may be used 
30 to denature the polynucleotides include pressure and pH. 
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The polynucleotides may be re-annealed by cooling. Preferably the 
temperature is from 20 °C to 75 °C, more preferably the temperature is from 40 °C to 
65 °C. If a high frequency of crossovers is needed based on an average of only 4 
consecutive bases of homology, recombination can be forced by using a low 
5 annealing temperature, although the process becomes more difficult. The degree of 
renaturation which occurs will depend on the degree of homology between the 
population of single-stranded polynucleotides. 

Renaturation can be accelerated by the addition of polyethylene glycol 
10 ("PEG") or salt. The salt concentration is preferably from 0 mM to 200 mM, more 
preferably the salt concentration is from 10 mM to 100 mm. The salt may be KC1 or 
NaCl. The concentration of PEG is preferably from 0% to 20%, more preferably from 
5% to 10%. 

1 5 The annealed polynucleotides are next incubated in the presence of a nucleic 

acid polymerase and dNTP's (i.e. dATP, dCTP, DGTP and dTTP). The nucleic acid 
polymerase may be the Klenow fragment, the Taq polymerase or any other DNA 
polymerase known in the art. 

20 The approach to be used for the assembly depends on the minimum degree of 

homology that should still yield crossovers. If the areas of identity are large, Taq 
polymerase can be used with an annealing temperature of between 45-65 °C. If the 
areas of identity are small, Klenow polymerase can be used with an annealing 
temperature of between 20-30 °C. One skilled in the art could vary the temperature of 

25 annealing to increase the number of cross-overs achieved. 

The polymerase may be added to the random polynucleotides prior to 
annealing, simultaneously with annealing or after annealing. 

30 The cycle of denaturation, renaturation and incubation in the presence of 

polymerase is referred to herein as shuffling or reassembly of the nucleic acid. This 
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cycle is repeated for a desired number of times. Preferably the cycle is repeated from 
2 to 50 times, more preferably the sequence is repeated from 10 to 40 times. 

The resulting nucleic acid is a larger double-stranded polynucleotide of from 
5 about 50 bp to about 100 kb, preferably the larger polynucleotide is from 500 bp to 50 
kb. 

This larger polynucleotides may contain a number of copies of a 
polynucleotide having the same size as the template -polynucleotide in tandem. This 

10 concatemeric polynucleotide is then denatured into single copies of the template 

polynucleotide. The result will be a population of polynucleotides of approximately 
the same size as the template polynucleotide. The population will be a mixed 
population where single or double-stranded polynucleotides having an area of identity 
and an area of heterology have been added to the template polynucleotide prior to 

1 5 shuffling. These polynucleotides are then cloned into the appropriate vector and the 
ligation mixture used to transform bacteria. 

It is contemplated that the single polynucleotides may be obtained from the 
larger concatemeric polynucleotide by amplification of the single polynucleotide prior 
20 to cloning by a variety of methods including PCR (USPN 4,683,195 and USPN 
4,683,202), rather than by digestion of the concatemer. 

The vector used for cloning is not critical provided that it will accept a 
polynucleotide of the desired size. If expression of the particular polynucleotide is 
25 desired, the cloning vehicle should further comprise transcription and translation 
signals next to the site of insertion of the polynucleotide to allow expression of the 
polynucleotide in the host cell. 

The resulting bacterial population will include a number of recombinant 
30 polynucleotides having random mutations. This mixed population may be tested to 
identify the desired recombinant polynucleotides. The method of selection will 
depend on the polynucleotide desired. 
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For example, if a polynucleotide, identified by the methods of described 
herein, encodes a protein with a first binding affinity, subsequent mutated (e.g., 
shuffled) sequences having an increased binding efficiency to a ligand may be 
5 desired. In such a case the proteins expressed by each of the portions of the 

polynucleotides in the population or library may be tested for their ability to bind to 
the ligand by methods known in the art (i.e. panning, affinity chromatography). If a 
polynucleotide which encodes for a protein with increased drug resistance is desired, 
the proteins expressed by each of the polynucleotides in the population or library may 
10 be tested for their ability to confer drug resistance to the host organism. One skilled 
in the art, given knowledge of the desired protein, could readily test the population to 
identify polynucleotides which confer the desired properties onto the protein. 

It is contemplated that one skilled in the art could use a phage display system 
15 in which fragments of the protein are expressed as fusion proteins on the phage 

surface (Pharmacia, Milwaukee WI). The recombinant DNA molecules are cloned 
into the phage DNA at a site which results in the transcription of a fusion protein a 
portion of which is encoded by the recombinant DNA molecule. The phage 
containing the recombinant nucleic acid molecule undergoes replication and 
20 transcription in the cell. The leader sequence of the fusion protein directs the 

transport of the fusion protein to the tip of the phage particle. Thus the fusion protein 
which is partially encoded by the recombinant DNA molecule is displayed on the 
phage particle for detection and selection by the methods described above. 

25 It is further contemplated that a number of cycles of nucleic acid shuffling 

may be conducted with polynucleotides from a sub-population of the first population, 
which sub-population contains DNA encoding the desired recombinant protein. In 
this manner, proteins with even higher binding affinities or enzymatic activity could 
be achieved. 

30 

It is also contemplated that a number of cycles of nucleic acid shuffling may 
be conducted with a mixture of wild-type polynucleotides and a sub-population of 
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nucleic acid from the first or subsequent rounds of nucleic acid shuffling in order to 
remove any silent mutations from the sub-population. 

Any source of nucleic acid, in a purified form can be utilized as the starting 
5 nucleic acid. Thus the process may employ DNA or RNA including messenger RNA, 
which DNA or RNA may be single or double stranded. In addition, a DNA-RNA 
hybrid which contains one strand of each may be utilized. The nucleic acid sequence 
may be of various lengths depending on the size of the nucleic acid sequence to be 
mutated. Preferably the specific nucleic acid sequence is from 50 to 50000 base pairs. 
10 It is contemplated that entire vectors containing the nucleic acid encoding the protein 
of interest may be used in the methods of the invention. 

Any specific nucleic acid sequence can be used to produce the population of 
hybrids by the present process. It is only necessary that a small population of hybrid 
1 5 sequences of the specific nucleic acid sequence exist or be available for the present 
process. 

A population of specific nucleic acid sequences having mutations may be 
created by a number of different methods. Mutations may be created by error-prone 

20 PCR. Error-prone PCR uses low- fidelity polymerization conditions to introduce a 

low level of point mutations randomly over a long sequence. Alternatively, mutations 
can be introduced into the template polynucleotide by oligonucleotide-directed 
mutagenesis. In oligonucleotide-directed mutagenesis, a short sequence of the 
polynucleotide is removed from the polynucleotide using restriction enzyme digestion 

25 and is replaced with a synthetic polynucleotide in which various bases have been 

altered from the original sequence. The polynucleotide sequence can also be altered 
by chemical mutagenesis. Chemical mutagens include, for example, sodium bisulfite, 
nitrous acid, hydroxylamine, hydrazine or formic acid, other agents which are 
analogues of nucleotide precursors include nitrosoguanidine, 5-bromouracil, 

30 2-aminopurine, or acridine. Generally, these agents are added to the PCR reaction in 
place of the nucleotide precursor thereby mutating the sequence. Intercalating agents 
such as proflavine, acriflavine, quinacrine and the like can also be used. Random 
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mutagenesis of the polynucleotide sequence can also be achieved by irradiation with 
X-rays or ultraviolet light. Generally, plasmid polynucleotides so mutagenized are 
introduced into E. coli and propagated as a pool or library of hybrid plasmids. 

5 Alternatively, a small mixed population of specific nucleic acids may be found 

in nature in that they may consist of different alleles of the same gene or the same 
gene from different related species (i.e., cognate genes). Alternatively, they may be 
related DNA sequences found within one species, for example, the immunoglobulin 
genes. 

10 

Once a mixed population of specific nucleic acid sequences is generated, the 
polynucleotides can be used directly or inserted into an appropriate cloning vector, 
using techniques well-known in the art. 

1 5 The choice of vector depends on the size of the polynucleotide sequence and 

the host cell to be employed in the methods of the invention. The templates of the 
invention may be plasmids, phages, cosmids, phagemids, viruses (e.g., retroviruses, 
parainfluenzavirus, herpesviruses, reoviruses, paramyxoviruses, and the like), or 
selected portions thereof (e.g., coat protein, spike glycoprotein, capsid protein). For 

20 example, cosmids and phagemids are preferred where the specific nucleic acid 

sequence to be mutated is larger because these vectors are able to stably propagate 
large polynucleotides. 

If a mixed population of the specific nucleic acid sequence is cloned into a 
25 vector it can be clonally amplified. Utility can be readily determined by screening 
expressed polypeptides. 

The DNA shuffling method of the invention can be performed blindly on a 
pool of unknown sequences. By adding to the reassembly mixture oligonucleotides 
30 (with ends that are homologous to the sequences being reassembled) any sequence 
mixture can be incorporated at any specific position into another sequence mixture. 
Thus, it is contemplated that mixtures of synthetic oligonucleotides, PCR 
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polynucleotides or even whole genes can be mixed into another sequence library at 
defined positions. The insertion of one sequence (mixture) is independent from the 
insertion of a sequence in another part of the template. Thus, the degree of 
recombination, the homology required, and the diversity of the library can be 
5 independently and simultaneously varied along the length of the reassembled DNA. 

Shuffling requires the presence of homologous regions separating regions of 
diversity. Scaffold-like protein structures may be particularly suitable for shuffling. 
The conserved scaffold determines the overall folding by self-association, while 
10 displaying relatively unrestricted loops that mediate the specific binding. Examples 
of such scaffolds are the immunoglobulin beta-barrel, and the four-helix bundle which 
are well-known in the art. This shuffling can be used to create scaffold-like proteins 
with various combinations of mutated sequences for binding. 

1 5 In vitro Shuffling 

The equivalents of some standard genetic matings may also be performed by 
shuffling in vitro. For example, a "molecular backcross" can be performed by 
repeatedly mixing the hybrid's nucleic acid with the wild-type nucleic acid while 

20 selecting for the mutations of interest. As in traditional breeding, this approach can be 
used to combine phenotypes from different sources into a background of choice. It is 
useful, for example, for the removal of neutral mutations that affect unselected 
characteristics {e.g., immunogenicity). Thus it can be useful to determine which 
mutations in a protein are involved in the enhanced biological activity and which are 

25 not, an advantage which cannot be achieved by error-prone mutagenesis or cassette 
mutagenesis methods. 

Large, functional genes can be assembled correctly from a mixture of small 
random polynucleotides. This reaction may be of use for the reassembly of genes 
30 from the highly fragmented DNA of fossils. In addition random nucleic acid 

fragments from fossils may be combined with polynucleotides from similar genes 
from related species. 
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It is also contemplated that the method of the invention can be used for the in 
vitro amplification of a whole genome from a single cell as is needed for a variety of 
research and diagnostic applications. DNA amplification by PCR typically includes 
5 sequences of about 40 kb. Amplification of a whole genome such as that of E. coli (5, 
000 kb) by PCR would require about 250 primers yielding 125 forty kb 
polynucleotides. On the other hand, random production of polynucleotides of the 
genome with sexual PCR cycles, followed by gel purification of small 
polynucleotides will provide a multitude of possible primers. Use of this mix of 
1 0 random small polynucleotides as primers in a PCR reaction alone or with the whole 
genome as the template should result in an inverse chain reaction with the theoretical 
endpoint of a single concatamer containing many copies of the genome. 

A 1 00 fold amplification in the copy number and an average polynucleotide 
1 5 size of greater than 50 kb may be obtained when only random polynucleotides are 

used. It is thought that the larger concatamer is generated by overlap of many smaller 
polynucleotides. The quality of specific PCR products obtained using synthetic 
primers will be indistinguishable from the product obtained from unamplified DNA. 
It is expected that this approach will be useful for the mapping of genomes. 

20 

The polynucleotide to be shuffled can be produced as random or non-random 
polynucleotides, at the discretion of the practitioner. Moreover, the invention 
provides a method of shuffling that is applicable to a wide range of polynucleotide 
sizes and types, including the step of generating polynucleotide monomers to be used 
25 as building blocks in the reassembly of a larger polynucleotide. For example, the 

building blocks can be fragments of genes or they can be comprised of entire genes or 
gene pathways, or any combination thereof. 

In vivo Shuffling 

30 In an embodiment of in vivo shuffling, a mixed population of a specific 

nucleic acid sequence is introduced into bacterial or eukaryotic cells under conditions 
such that at least two different nucleic acid sequences are present in each host cell. 
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The polynucleotides can be introduced into the host cells by a variety of different 
methods. The host cells can be transformed with the smaller polynucleotides using 
methods known in the art, for example treatment with calcium chloride. If the 
polynucleotides" are inserted into a phage genome, the host cell can be transfected with 
5 the recombinant phage genome having the specific nucleic acid sequences. 

Alternatively, the nucleic acid sequences can be introduced into the host cell using 
electroporation, transfection, lipofection, biolistics, conjugation, and the like. 

In general, in this embodiment, specific nucleic acid sequences will be present 
10 in vectors which are capable of stably replicating the sequence in the host cell. In 
addition, it is contemplated that the vectors will encode a marker gene such that host 
cells having the vector can be selected. This ensures that the mutated specific nucleic 
acid sequence can be recovered after introduction into the host cell. However, it is 
contemplated that the entire mixed population of the specific nucleic acid sequences 
15 need not be present on a vector sequence. Rather only a sufficient number of 
sequences need be cloned into vectors to ensure that after introduction of the 
polynucleotides into the host cells each host cell contains one vector having at least 
one specific nucleic acid sequence present therein. It is also contemplated that rather 
than having a subset of the population of the specific nucleic acids sequences cloned 
20 into vectors, this subset may be already stably integrated into the host cell. 

It has been found that when two polynucleotides which have regions of 
identity are inserted into the host cells homologous recombination occurs between the 
two polynucleotides. Such recombination between the two mutated specific nucleic 
25 acid sequences will result in the production of double or triple hybrids in some 
situations. 

It has also been found that the frequency of recombination is increased if some 
of the mutated specific nucleic acid sequences are present on linear nucleic acid 
30 molecules. Therefore, in a one embodiment, some of the specific nucleic acid 
sequences are present on linear polynucleotides. 
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After transformation, the host cell transformants are placed under selection to 
identify those host cell transformants which contain mutated specific nucleic acid 
sequences having the qualities desired. For example, if increased resistance to a 
particular drug is desired then the transformed host cells may be subjected to 
5 increased concentrations of the particular drug and those transformants producing 
mutated proteins able to confer increased drug resistance will be selected. If the 
enhanced ability of a particular protein to bind to a receptor is desired, then 
expression of the protein can be induced from the transformants and the resulting 
protein assayed in a ligand binding assay by methods known in the art to identify that 
10 subset of the mutated population which shows enhanced binding to the ligand. 
Alternatively, the protein can be expressed in another system to ensure proper 
processing. 

Once a subset of the first recombined specific nucleic acid sequences 
1 5 (daughter sequences) having the desired characteristics are identified, they are then 

subject to a second round of recombination. In the second cycle of recombination, the 
recombined specific nucleic acid sequences may be mixed with the original mutated 
specific nucleic acid sequences (parent sequences) and the cycle repeated as described 
above. In this way a set of second recombined specific nucleic acids sequences can 
20 be identified which have enhanced characteristics or encode for proteins having 
enhanced properties. This cycle can be repeated a number of times as desired. 

It is also contemplated that in the second or subsequent recombination cycle, a 
backcross can be performed. A molecular backcross can be performed by mixing the 

25 desired specific nucleic acid sequences with a large number of the wild-type 

sequence, such that at least one wild-type nucleic acid sequence and a mutated nucleic 
acid sequence are present in the same host cell after transformation. Recombination 
with the wild-type specific nucleic acid sequence will eliminate those neutral 
mutations that may affect unselected characteristics such as immunogenicity but not 

30 the selected characteristics. 
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In another embodiment of the invention, it is contemplated that during the first 
round a subset of specific nucleic acid sequences can be generated as smaller 
polynucleotides by slowing or halting their PCR amplification prior to introduction 
into the host cell. The size of the polynucleotides must be large enough to contain 
5 some regions of identity with the other sequences so as to homologously recombine 
with the other sequences. The size of the polynucleotides will range from 0.03 kb to 
100 kb more preferably from 0. 2 kb to 10 kb. It is also contemplated that in 
subsequent rounds, all of the specific nucleic acid sequences other than the sequences 
selected from the previous round may be utilized to generate PCR polynucleotides 
10 prior to introduction into the host cells. 

The shorter polynucleotide sequences can be single-stranded or 
double-stranded. The reaction conditions suitable for separating the strands of nucleic 
acid are well known in the art. 
15 -s 

The steps of this process can be repeated indefinitely, being limited only by 
the number of possible hybrids which can be achieved. 

f 

Therefore, the initial pool or population of mutated template nucleic acid is f 
20 cloned into a vector capable of replicating in a bacteria such as E. coll The particular 
vector is not essential, so long as it is capable of autonomous replication in E. coli. In 
a one embodiment, the vector is designed to allow the expression and production of 
any protein encoded by the mutated specific nucleic acid linked to the vector. It is 
also preferred that the vector contain a gene encoding for a selectable marker. 

25 

The population of vectors containing the pool of mutated nucleic acid 
sequences is introduced into the E. coli host cells. The vector nucleic acid sequences 
may be introduced by transformation, transfection or infection in the case of phage. 
The concentration of vectors used to transform the bacteria is such that a number of 
30 vectors is introduced into each cell. Once present in the cell, the efficiency of 

homologous recombination is such that homologous recombination occurs between 
the various vectors. This results in the generation of hybrids (daughters) having a 
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combination of mutations which differ from the original parent mutated sequences. 
The host cells are then clonally replicated and selected for the marker gene present on 
the vector. Only those cells having a plasmid will grow under the selection. The host 
cells" which contain a vector are then tested for the presence of favorable mutations. 

5 

Once a particular daughter mutated nucleic acid sequence has been identified 
which confers the desired characteristics, the nucleic acid is isolated either already 
linked to the vector or separated from the vector. This nucleic acid is then mixed with 
the first or parent population of nucleic acids and the cycle is repeated. 

10 

The parent mutated specific nucleic acid population, either as polynucleotides 
or cloned into the same vector is introduced into the host cells already containing the 
daughter nucleic acids. Recombination is allowed to occur in the cells and the next 
generation of recombinants, or granddaughters are selected by the methods described 
15 above. This cycle can be repeated a number of times until the nucleic acid or peptide 
having the desired characteristics is obtained. It is contemplated that in subsequent 
cycles, the population of mutated sequences which are added to the hybrids may come f 
from the parental hybrids or any subsequent generation. 

20 In an alternative embodiment, the invention provides a method of conducting a ^ 

"molecular" backcross of the obtained recombinant specific nucleic acid in order to ! ' 
eliminate any neutral mutations. Neutral mutations are those mutations which do not 
confer onto the nucleic acid or peptide the desired properties. Such mutations may 
however confer on the nucleic acid or peptide undesirable characteristics. 

25 Accordingly, it is desirable to eliminate such neutral mutations. The method of the 
invention provide a means of doing so. 

In this embodiment, after the hybrid nucleic acid, having the desired 
characteristics, is obtained by the methods of the embodiments, the nucleic acid, the 
30 vector having the nucleic acid or the host cell containing the vector and nucleic acid is 
isolated. 
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The nucleic acid or vector is then introduced into the host cell with a large 
excess of the wild-type nucleic acid. The nucleic acid of the hybrid and the nucleic 
acid of the wild-type sequence are allowed to recombine. The resulting recombinants 
are placed under the same selection as the hybrid nucleic acid. Only those ~ 
recombinants which retained the desired characteristics will be selected. Any silent 
mutations which do not provide the desired characteristics will be lost through 
recombination with the wild-type DNA. This cycle can be repeated a number of 
times until all of the silent mutations are eliminated. 
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Exonuclease -Mediated Reassembly 

In a another embodiment, the invention provides for a method for shuffling, 
assembling, reassembling, recombining, and/or concatenating at least two 
5 polynucleotides to form a progeny polynucleotide {e.g. , a chimeric progeny 

polynucleotide that can be expressed to produce a polypeptide or a gene pathway). In 
a particular embodiment, a double stranded polynucleotide {e.g., two single stranded 
sequences hybridized to each other as hybridization partners) is treated with an 
exonuclease to liberate nucleotides from one of the two strands, leaving the remaining 
10 strand free of its original partner so that, if desired, the remaining strand may be used 
to achieve hybridization to another partner. 

In a particular aspect, a double stranded polynucleotide end (that may be part 
of - or connected to - a polynucleotide or a nonpolynucleotide sequence) is subjected 

15 to a source of exonuclease activity. Serviceable sources of exonuclease activity may 
be an enzyme with 3' exonuclease activity, an enzyme with 5' exonuclease activity, 
an enzyme with both 3' exonuclease activity and 5' exonuclease activity, and any 
combination thereof. An exonuclease can be used to liberate nucleotides from one or 
both ends of a linear double stranded polynucleotide, and from one to all ends of a 

20 branched polynucleotide having more than two ends. 

By contrast, a non-enzymatic step may be used to shuffle, assemble, 
reassemble, recombine, and/or concatenate polynucleotide building blocks that is 
comprised of subjecting a working sample to denaturing (or "melting") conditions 

25 (for example, by changing temperature, pH, and /or salinity conditions) so as to melt a 
working set of double stranded polynucleotides into single polynucleotide strands. 
For shuffling, it is desirable that the single polynucleotide strands participate to some 
extent in annealment with different hybridization partners (/. e. and not merely revert 
to exclusive reannealment between what were former partners before the denaturation 

30 step). The presence of the former hybridization partners in the reaction vessel, 

however, does not preclude, and may sometimes even favor, reannealment of a single 
stranded polynucleotide with its former partner, to recreate an original double 
stranded polynucleotide. 
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In contrast to this non-enzymatic shuffling step comprised of subjecting 
double stranded polynucleotide building blocks to denaturation, followed by 
annealment, the invention further provides an exonuclease-based approach requiring 
5 no denaturation - rather, the avoidance of denaturing conditions and the maintenance 
of double stranded polynucleotide substrates in annealed (/. e. non-denatured) state are 
necessary conditions for the action of exonucleases (e.g., exonuclease III and red 
alpha gene product). Additionally, in contrast, the generation of single stranded 
polynucleotide sequences capable of hybridizing to other single stranded 

10 polynucleotide sequences is the result of covalent cleavage - and hence sequence 
destruction - in one of the hybridization partners. For example, an exonuclease III 
enzyme may be used to enzymatically liberate 3 ' terminal nucleotides in one 
hybridization strand (to achieve covalent hydrolysis in that polynucleotide strand); 
and this favors hybridization of the remaining single strand to a new partner (since its 

1 5 former partner was subjected to covalent cleavage). 

It is particularly appreciated that enzymes can be discovered, optimized (e.g., 
engineered by directed evolution), or both discovered and optimized specifically for 
the instantly disclosed approach that have more optimal rates and/or more highly 
20 specific activities &/or greater lack of unwanted activities. In fact it is expected that 
the invention may encourage the discovery and/or development of such designer 
enzymes. 

Furthermore, it is appreciated that one can protect the end of a double stranded 
25 polynucleotide or render it susceptible to a desired enzymatic action of a serviceable 
exonuclease as necessary. For example, a double stranded polynucleotide end having 
a V overhang is not susceptible to the exonuclease action of exonuclease III. 
However, it may be rendered susceptible to the exonuclease action of exonuclease III 
by a variety of means; for example, it may be blunted by treatment with a polymerase, 
30 cleaved to provide a blunt end or a 5' overhang, joined (ligated or hybridized) to 
another double stranded polynucleotide to provide a blunt end or a 5' overhang, 
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hybridized to a single stranded polynucleotide to provide a blunt end or a 5 ' overhang, 
or modified by any of a variety of means). 

According to one aspect, an exonuclease may be allowed to act on one or on 
5 both ends of a linear double stranded polynucleotide and proceed to completion, to 
near completion, or to partial completion. When the exonuclease action is allowed to 
go to completion, the result will be that the length of each 5' overhang will be extend 
far towards the middle region of the polynucleotide in the direction of what might be 
considered a "rendezvous point" (which may be somewhere near the polynucleotide 
10 midpoint). Ultimately, this results in the production of single stranded 

polynucleotides (that can become dissociated) that are each about half the length of 
the original double stranded polynucleotide. 

Thus this exonuclease-mediated approach is serviceable for shuffling, 
1 5 assembling and/or reassembling, recombining, and concatenating polynucleotide 

building blocks, which polynucleotide building blocks can be up to ten bases long or 
tens of bases long or hundreds of bases long or thousands of bases long or tens of 
thousands of bases long or hundreds of thousands of bases long or millions of bases 
long or even longer. 

20 

Substrates for an exonuclease may be generated by subjecting a double 
stranded polynucleotide to fragmentation. Fragmentation may be achieved by 
mechanical means (e.g., shearing, sonication, etc.), by enzymatic means {e.g., using 
restriction enzymes), and by any combination thereof. Fragments of a larger 
25 polynucleotide may also be generated by polymerase-mediated synthesis. 

Additional examples of enzymes with exonuclease activity include red-alpha 
and venom phosphodiesterases. Red alpha (redd) gene product (also referred to as 
lambda exonuclease) is of bacteriophage X origin. Red alpha gene product acts 
30 processively from 5'-phosphorylated termini to liberate mononucleotides from duplex 
DNA (Takahashi & Kobayashi, 1990). Venom phosphodiesterases (Laskowski, 
1980) is capable of rapidly opening supercoiled DNA. 
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Non-stochastic Ligation Reassembly 



In one aspect, the present invention provides a non-stochastic method termed 
synthetic ligation reassembly (SLR), that is somewhat related to stochastic shuffling, 
5 save that the nucleic acid building blocks are not shuffled or concatenated or 
chimerized randomly, but rather are assembled non-stochastically. 

The SLR method does not depend on the presence of a high level of homology 
between polynucleotides to be shuffled. The invention can be used to non- 
1 0 stochastically generate libraries (or sets) of progeny molecules comprised of over 
10 100 different chimeras. Conceivably, SLR can even be used to generate libraries 
comprised of over 10 1000 different progeny chimeras. 

Thus, in one aspect, the invention provides a non-stochastic method of 
1 5 producing a set of finalized chimeric nucleic acid molecules having an overall 

assembly order that is chosen by design, which method is comprised of the steps of 
generating by design a plurality of specific nucleic acid building blocks having 
serviceable mutually compatible ligatable ends, and assembling these nucleic acid 
building blocks, such that a designed overall assembly order is achieved. 

20 

The mutually compatible ligatable ends of the nucleic acid building blocks to 
be assembled are considered to be "serviceable" for this type of ordered assembly if 
they enable the building blocks to be coupled in predetermined orders. Thus, in one 
aspect, the overall assembly order in which the nucleic acid building blocks can be 

25 coupled is specified by the design of the ligatable ends and, if more than one assembly 
step is to be used, then the overall assembly order in which the nucleic acid building 
blocks can be coupled is also specified by the sequential order of the assembly step(s). 
In a one embodiment of the invention, the annealed building pieces are treated with an 
enzyme, such as a ligase (e.g., T4 DNA ligase) to achieve covalent bonding of the 

30 building pieces. 

In a another embodiment, the design of nucleic acid building blocks is 
obtained upon analysis of the sequences of a set of progenitor nucleic acid templates 
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that serve as a basis for producing a progeny set of finalized chimeric nucleic acid 
molecules. These progenitor nucleic acid templates thus serve as a source of 
sequence information that aids in the design of the nucleic acid building blocks that 
are to be mutagenized, /. e. chimerized or shuffled. 

5 

In one exemplification, the invention provides for the chimerization of a 
family of related genes and their encoded family of related products. In a particular 
exemplification, the encoded products are enzymes. As a representative list of 
families of enzymes which may be mutagenized in accordance with the aspects of the 

10 present invention, there may be mentioned, the following enzymes and their 
functions: Lipase/Esterase, Protease, Glycosidase/Glycosyl, transferase, 
Phosphatase/Kinase, Mono/Dioxygenase, Haloperoxidase, Lignin, 
peroxidase/Diarylpropane peroxidase, Epoxide hydrolase, Nitrile hydratase/nitrilase, 
Transaminase, Amidase/Acylase. These exemplifications, while illustrating certain 

1 5 specific aspects of the invention, do not portray the limitations or circumscribe the 
scope of the disclosed invention. 

Thus according to one aspect of the invention, the sequences of a plurality of 
progenitor nucleic acid templates identified using the methods of the invention are 
20 aligned in order to select one or more demarcation points, which demarcation points 
can be located at an area of homology. The demarcation points can be used to 
delineate the boundaries of nucleic acid building blocks to be generated. Thus, the 
demarcation points identified and selected in the progenitor molecules serve as 
potential chimerization points in the assembly of the progeny molecules. 

25 

Typically a serviceable demarcation point is an area of homology (comprised 
of at least one homologous nucleotide base) shared by at least two progenitor 
templates, but the demarcation point can be an area of homology that is shared by at 
least half of the progenitor templates, at least two thirds of the progenitor templates, at 
30 least three fourths of the progenitor templates, and preferably at almost all of the 

progenitor templates. Even more preferably still a serviceable demarcation point is an 
area of homology that is shared by all of the progenitor templates. 
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In a preferred embodiment, the ligation reassembly process is performed 
exhaustively in order to generate an exhaustive library. In other words, all possible 
ordered combinations of the nucleic acid building blocks are represented in the set of 
5 finalized chimeric nucleic acid molecules. At the same time, the assembly order (i.e. 
the order of assembly of each building block in the 5 ' to 3 sequence of each finalized 
chimeric nucleic acid) in each combination is by design (or non-stochastic). Because 
of the non-stochastic nature of the invention, the possibility of unwanted side products 
is greatly reduced. 

10 

In another preferred embodiment, the invention provides that, the ligation 
reassembly process is performed systematically, for example in order to generate a 
systematically compartmentalized library, with compartments that can be screened 
systematically, e.g., one by one. In other words the invention provides that, through 

1 5 the selective and judicious use of specific nucleic acid building blocks, coupled with 
the selective and judicious use of sequentially stepped assembly reactions, an 
experimental design can be achieved where specific sets of progeny products are 
made in each of several reaction vessels. This allows a systematic examination and 
screening procedure to be performed. Thus, it allows a potentially very large number 

20 of progeny molecules to be examined systematically in smaller groups. 

Because of its ability to perform chimerizations in a manner that is highly 
flexible yet exhaustive and systematic as well, particularly when there is a low level 
of homology among the progenitor molecules, the instant invention provides for the 

25 generation of a library (or set) comprised of a large number of progeny molecules. 

Because of the non-stochastic nature of the instant ligation reassembly invention, the 
progeny molecules generated preferably comprise a library of finalized chimeric 
nucleic acid molecules having an overall assembly order that is chosen by design. In 
a particularly embodiment, such a generated library is comprised of greater than 10 3 

30 to greater than io 1000 different progeny molecular species. 
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In one aspect, a set of finalized chimeric nucleic acid molecules, produced as 
described is comprised of a polynucleotide encoding a polypeptide. According to one 
embodiment, this polynucleotide is a gene, which may be a man-made gene. 
According to another embodiment, this polynucleotide is a gene pathway, which may 
5 be a man-made gene pathway. The invention provides that one or more man-made 
genes generated by the invention may be incorporated into a man-made gene 
pathway, such as pathway operable in a eukaryotic organism (including a plant). 

In another exemplifaction, the synthetic nature of the step in which the 
10 building blocks are generated allows the design and introduction of nucleotides (e.g., 
one or more nucleotides, which may be, for example, codons or introns or regulatory 
sequences) that can later be optionally removed in an in vitro process (e.g., by 
mutagenesis) or in an in vivo process (e.g., by utilizing the gene splicing ability of a 
host organism). It is appreciated that in many instances the introduction of these 
1 5 nucleotides may also be desirable for many other reasons in addition to the potential 
benefit of creating a serviceable demarcation point. 

Thus, according to another embodiment, the invention provides that a nucleic 
acid building block can be used to introduce an intron. Thus, the invention provides 
20 that functional introns may be introduced into a man-made gene of the invention. The 
invention also provides that functional introns may be introduced into a man-made 
gene pathway of the invention. Accordingly, the invention provides for the 
generation of a chimeric polynucleotide that is a man-made gene containing one (or 
more) artificially introduced intron(s). 

25 

Accordingly, the invention also provides for the generation of a chimeric 
polynucleotide that is a man-made gene pathway containing one (or more) artificially 
introduced intron(s). Preferably, the artificially introduced intron(s) are functional in 
one or more host cells for gene splicing much in the way that naturally-occurring 
30 introns serve functionally in gene splicing. The invention provides a process of 
producing man-made intron-containing polynucleotides to be introduced into host 
organisms for recombination and/or splicing. 
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A man-made genes produced using the invention can also serve as a substrate 
for recombination with another nucleic acid. Likewise, a man-made gene pathway 
produced using the invention can also serve as a substrate for recombination with 
5 another nucleic acid. In a preferred instance, the recombination is facilitated by, or 
occurs at, areas of homology between the man-made intron-containing gene and a 
nucleic acid with serves as a recombination partner. In a particularly preferred 
instance, the recombination partner may also be a nucleic acid generated by the 
invention, including a man-made gene or a man-made, gene pathway. Recombination 
10 may be facilitated by or may occur at areas of homology that exist at the one (or 
more) artificially introduced intron(s) in the man-made gene. 

The synthetic ligation reassembly method of the invention utilizes a plurality 
of nucleic acid building blocks, each of which preferably has two ligatable ends. The 
1 5 two ligatable ends on each nucleic acid building block may be two blunt ends (7. e. 
each having an overhang of zero nucleotides), or preferably one blunt end and one 
overhang, or more preferably still two overhangs. 

A serviceable overhang for this purpose may be a 3' overhang or a 5' 
20 overhang. Thus, a nucleic acid building block may have a 3' overhang or 

alternatively a 5' overhang or alternatively two 3' overhangs or alternatively two 5' 
overhangs. The overall order in which the nucleic acid building blocks are assembled 
to form a finalized chimeric nucleic acid molecule is determined by purposeful 
experimental design and is not random. 

25 

According to one preferred embodiment, a nucleic acid building block is 
generated by chemical synthesis of two single-stranded nucleic acids (also referred to 
as single-stranded oligos) and contacting them so as to allow them to anneal to form a 
double-stranded nucleic acid building block. 

30 

A double-stranded nucleic acid building block can be of variable size. The 
sizes of these building blocks can be small or large. Preferred sizes for building block 
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range from 1 base pair (not including any overhangs) to 1 00,000 base pairs (not 
including any overhangs). Other preferred size ranges are also provided, which have 
lower limits of from 1 bp to 10,000 bp (including every integer value in between), and 
upper limits of from 2 bp to 100, 000 bp (including every integer value in between). 

5 

Many methods exist by which a double-stranded nucleic acid building block 
can be generated that is serviceable for the invention; and these are known in the art 
and can be readily performed by the skilled artisan. 

10 According to one embodiment, a double-stranded nucleic acid building block 

is generated by first generating two single stranded nucleic acids and allowing them to 
anneal to form a double-stranded nucleic acid building block. The two strands of a 
double-stranded nucleic acid building block may be complementary at every 
nucleotide apart from any that form an overhang; thus containing no mismatches, 

1 5 apart from any overhang(s). According to another embodiment, the two strands of a 
double-stranded nucleic acid building block are complementary at fewer than every 
nucleotide apart from any that form an overhang. Thus, according to this 
embodiment, a double-stranded nucleic acid building block can be used to introduce 
codon degeneracy. Preferably the codon degeneracy is introduced using the site- 

20 saturation mutagenesis described herein, using one or more N,N,G/T cassettes or 
alternatively using one or more N,N,N cassettes. 

The in vivo recombination method of the invention can be performed blindly 
on a pool of unknown hybrids or alleles of a specific polynucleotide or sequence. 
25 However, it is not necessary to know the actual DNA or RNA sequence of the specific 
polynucleotide. 

The approach of using recombination within a mixed population of genes can 
be useful for the generation of any useful proteins, for example, interleukin I, 
30 antibodies, tPA and growth hormone. This approach may be used to generate proteins 
having altered specificity or activity. The approach may also be useful for the 
generation of hybrid nucleic acid sequences, for example, promoter regions, introns, 
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exons, enhancer sequences, 3 1 untranslated regions or 5 1 untranslated regions of 
genes. Thus this approach may be used to generate genes having increased rates of 
expression. This approach may also be useful in the study of repetitive DNA 
sequences. Finally, this approach may be useful to mutate ribozymes or aptamers. 

5 End Selection 

The invention provides a method for selecting a subset of polynucleotides 
from a starting set of polynucleotides, which method is based on the ability to 
discriminate one or more selectable features (or selection markers) present anywhere 

10 in a working polynucleotide, so as to allow one to perform selection for (positive 
selection) and/or against (negative selection) each selectable polynucleotide. In a 
preferred aspect, a method is provided termed end-selection, which method is based 
on the use of a selection marker located in part or entirely in a terminal region of a 
selectable polynucleotide, and such a selection marker may be termed an "end- 

1 5 selection marker". 

End-selection may be based on detection of naturally occurring sequences or 
on detection of sequences introduced experimentally (including by any mutagenesis 
procedure mentioned herein and not mentioned herein) or on both, even within the 

20 same polynucleotide. An end-selection marker can be a structural selection marker or 
a functional selection marker or both a structural and a functional selection marker. 
An end-selection marker may be comprised of a polynucleotide sequence or of a 
polypeptide sequence or of any chemical structure or of any biological or biochemical 
tag, including markers that can be selected using methods based on the detection of 

25 radioactivity, of enzymatic activity, of fluorescence, of any optical feature, of a 
magnetic property (e.g., using magnetic beads), of immunoreactivity , and of 
hybridization. 

End-selection may be applied in combination with any method for performing 
30 mutagenesis. Such mutagenesis methods include, but are not limited to, methods 
described herein (supra and infra). Such methods include, by way of non-limiting 
exemplification, any method that may be referred herein or by others in the art by any 
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of the following terms: "saturation mutagenesis", "shuffling", "recombination", "re- 
assembly", "error-prone PCR", "assembly PCR", "sexual PCR", "crossover PCR", 
"oligonucleotide primer-directed mutagenesis", "recursive (and/or exponential) 
ensemble mutagenesis (see Arkin and Youvan, 1992)", "cassette mutagenesis", "/« 
5 vivo mutagenesis", and "/« vitro mutagenesis". Moreover, end-selection may be 
performed on molecules produced by any mutagenesis and/or amplification method 
(see, e.g., Arnold, 1993; Caldwell and Joyce, 1992; Stemmer, 1994) following which 
method it is desirable to select for (including to screen for the presence of) desirable 
progeny molecules. 

10 

In addition, end-selection may be applied to a polynucleotide apart from any 
mutagenesis method. In a one embodiment, end-selection, as provided herein, can be 
used in order to facilitate a cloning step, such as a step of ligation to another 
polynucleotide (including ligation to a vector). The invention thus provides for end- 
15 selection as a serviceable means to facilitate library construction, selection and/or 
enrichment for desirable polynucleotides, and cloning in general. 

In a another embodiment, end-selection can be based on (positive) selection 
for a polynucleotide; alternatively end-selection can be based on (negative) selection 

20 against a polynucleotide; and alternatively still, end-selection can be based on both 
(positive) selection for, and on (negative) selection against, a polynucleotide. End- 
selection, along with other methods of selection and/or screening, can be performed in 
an iterative fashion, with any combination of like or unlike selection and/or screening 
methods and serviceable mutagenesis methods, all of which can be performed in an 

25 iterative fashion and in any order, combination, and permutation. It is also appreciated 
that end-selection may also be used to select a polynucleotide in a: circular (e.g., a 
plasmid or any other circular vector or any other polynucleotide that is partly 
circular), and/or branched, and/or modified or substituted with any chemical group or 
moiety. 

30 

In one non-limiting aspect, end-selection of a linear polynucleotide is 
performed using a general approach based on the presence of at least one end- 
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selection marker located at or near a polynucleotide end or terminus (that can be 
either a 5' end or a 3' end). In one particular non-limiting exemplification, end- 
selection is based on selection for a specific sequence at or near a terminus such as, 
but not limited to, a sequence recognized by an enzyme that recognizes a 
5 polynucleotide sequence. An enzyme that recognizes and catalyzes a chemical 
modification of a polynucleotide is referred to herein as a polynucleotide-acting 
enzyme. In a preferred embodiment, serviceable polynucleotide-acting enzymes are 
exemplified non-exclusively by enzymes with polynucleotide-cleaving activity, 
enzymes with polynucleotide-methylating activity, enzymes with polynucleotide- 
1 0 ligating activity, and enzymes with a plurality of distinguishable enzymatic activities 
(including non-exclusively, e.g., both polynucleotide-cleaving activity and 
polynucleotide-ligating activity). 

It is appreciated that relevant polynucleotide-acting enzymes include any 
1 5 enzymes identifiable by one skilled in the art (e.g. , commercially available) or that 
may be developed in the future, though currently unavailable, that are serviceable for 
generating a ligation compatible end, preferably a sticky end, in a polynucleotide. It 
may be preferable to use restriction sites that are not contained, or alternatively that 
are not expected to be contained, or alternatively that are unlikely to be contained 
20 (e.g., when sequence information regarding a working polynucleotide is incomplete) 
internally in a polynucleotide to be subjected to end-selection. It is recognized that 
methods (e.g., mutagenesis methods) can be used to remove unwanted internal 
restriction sites. It is also appreciated that a partial digestion reaction (i.e. a digestion 
reaction that proceeds to partial completion) can be used to achieve digestion at a 
25 recognition site in a terminal region while sparing a susceptible restriction site that 
occurs internally in a polynucleotide and that is recognized by the same enzyme. In 
one aspect, partial digest are useful because it is appreciated that certain enzymes 
show preferential cleavage of the same recognition sequence depending on the 
location and environment in which the recognition sequence occurs. 

30 

It is also appreciated that protection methods can be used to selectively protect 
specified restriction sites (e.g., internal sites) against unwanted digestion by enzymes 
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that would otherwise cut a working polypeptide in response to the presence of those 
sites; and that such protection methods include modifications such as methylations 
and base substitutions (e.g., U instead of T) that inhibit an unwanted enzyme activity. 

5 In another embodiment of the invention, a serviceable end-selection marker is 

a terminal sequence that is recognized by a polynucleotide-acting enzyme that 
recognizes a specific polynucleotide sequence. In one aspect of the invention, 
serviceable polynucleotide-acting enzymes also include other enzymes in addition to 
classic type II restriction enzymes. According to this preferred aspect of the 
10 invention, serviceable polynucleotide-acting enzymes also include gyrases (e.g., 

topoisomerases), helicases, recombinases, relaxases, and any enzymes related thereto. 

It is appreciated that, end-selection can be used to distinguish and separate 
parental template molecules (e.g., to be subjected to mutagenesis) from progeny molecules 

1 5 (e.g. , generated by mutagenesis). For example, a first set of primers, lacking in a 

topoisomerase I recognition site, can be used to modify the terminal regions of the parental 
molecules (e.g. , in polymerase-based amplification). A different second set of primers 
(e.g., having a topoisomerase I recognition site) can then be used to generate mutated 
progeny molecules (e.g., using any polynucleotide chimerization method, such as 

20 interrupted synthesis, template-switching polymerase-based amplification, or interrupted 
synthesis; or using saturation mutagenesis; or using any other method for introducing a 
topoisomerase I recognition site into a mutagenized progeny molecule) from the amplified 
template molecules. The use of topoisomerase I-based end-selection can then facilitate, 
not only discernment, but selective topoisomerase I-based ligation of the desired progeny 

25 molecules. 

It is appreciated that an end-selection approach using topoisomerase-based 
nicking and ligation has several advantages over previously available selection 
methods. In sum, this approach allows one to achieve direction cloning (including 
30 expression cloning). 
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Peptide Display Methods 

The present method can be used to shuffle, by in vitro and/or in vivo 
recombination by any of the disclosed methods, and in any combination, 
5 polynucleotide sequences selected by peptide display methods, wherein an associated 
polynucleotide encodes a displayed peptide which is screened for a phenotype (e.g., 
for affinity for a predetermined receptor (ligand). 

An increasingly important aspect of bio-pharmaceutical drug development and 
10 molecular biology is the identification of peptide structures, including the primary 
amino acid sequences, of peptides or peptidomimetics that interact with biological 
macromolecules. One method of identifying peptides that possess a desired structure 
or functional property, such as binding to a predetermined biological macromolecule 
(e.g., a receptor), involves the screening of a large library or peptides for individual 
1 5 library members which possess the desired structure or functional property conferred 
by the amino acid sequence of the peptide. 

In addition to direct chemical synthesis methods for generating peptide 
libraries, several recombinant DNA methods also have been reported. One type 

20 involves the display of a peptide sequence, antibody, or other protein on the surface of 
a bacteriophage particle or cell. Generally, in these methods each bacteriophage 
particle or cell serves as an individual library member displaying a single species of 
displayed peptide in addition to the natural bacteriophage or cell protein sequences. 
Each bacteriophage or cell contains the nucleotide sequence information encoding the 

25 particular displayed peptide sequence; thus, the displayed peptide sequence can be 
ascertained by nucleotide sequence determination of an isolated library member. 

A well-known peptide display method involves the presentation of a peptide 
sequence on the surface of a filamentous bacteriophage, typically as a fusion with a 
30 bacteriophage coat protein. The bacteriophage library can be incubated with an 

immobilized, predetermined macromolecule or small molecule (e.g., a receptor) so 
that bacteriophage particles which present a peptide sequence that binds to the 
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immobilized macromolecule can be differentially partitioned from those that do not 
present peptide sequences that bind to the predetermined macromolecule. The 
bacteriophage particles (i.e., library members) which are bound to the immobilized 
macromolecule are then recovered and replicated to amplify the selected 
5 bacteriophage sub-population for a subsequent round of affinity enrichment and phage 
replication. After several rounds of affinity enrichment and phage replication, the 
bacteriophage library members that are thus selected are isolated and the nucleotide 
sequence encoding the displayed peptide sequence is determined, thereby identifying 
the sequence(s) of peptides that bind to the predetermined macromolecule (e.g., 
1 0 receptor). Such methods are further described in PCT patent publications WO 
91/17271, WO 91/18980, WO 91/19818 and WO 93/08278. 

The present invention also provides random, pseudorandom, and defined 
sequence framework peptide libraries and methods for generating and screening those 

1 5 libraries to identify useful compounds (e.g. , peptides, including single-chain 

antibodies) that bind to receptor molecules or epitopes of interest or gene products 
that modify peptides or RNA in a desired fashion. The random, pseudorandom, and 
defined sequence framework peptides are produced from libraries of peptide library 
members that comprise displayed peptides or displayed single-chain antibodies 

20 attached to a polynucleotide template from which the displayed peptide was 

synthesized. The mode of attachment may vary according to the specific embodiment 
of the invention selected, and can include encapsulation in a phage particle or 
incorporation in a cell. 

25 A significant advantage of the present invention is that no prior information 

regarding an expected ligand structure is required to isolate peptide ligands or 
antibodies of interest. The peptide identified can have biological activity, which is 
meant to include at least specific binding affinity for a selected receptor molecule and, 
in some instances, will further include the ability to block the binding of other 

30 compounds, to stimulate or inhibit metabolic pathways, to act as a signal or 
messenger, to stimulate or inhibit cellular activity, and the like. 
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The invention also provides a method for shuffling a pool of polynucleotide 
sequences identified by the methods of the invention and selected by affinity 
screening a library of polysomes displaying nascent peptides (including single-chain 
antibodies) for library members which bind to a predetermined receptor (e.g., a 
5 mammalian proteinaceous receptor such as, for example, a peptidergic hormone 
receptor, a cell surface receptor, an intracellular protein which binds to other 
protein(s) to form intracellular protein complexes such as hetero-dimers and the like) 
or epitope (e.g., an immobilized protein, glycoprotein, oligosaccharide, and the like). 

10 Polynucleotide sequences selected in a first selection round (typically by 

affinity selection for binding to a receptor (e.g., a ligand)) by any of these methods are 
pooled and the pool(s) is/are shuffled by in vitro and/or in vivo recombination to 
produce a shuffled pool comprising a population of recombined selected 
polynucleotide sequences. The recombined selected polynucleotide sequences are 

1 5 subjected to at least one subsequent selection round. The polynucleotide sequences 
selected in the subsequent selection round(s) can be used directly, sequenced, and/or 
subjected to one or more additional rounds of shuffling and subsequent selection. 
Selected sequences can also be back-crossed with polynucleotide sequences encoding 
neutral sequences (i.e., having insubstantial functional effect on binding), such as for 

20 example by back-crossing with a wild-type or naturally-occurring sequence 
substantially identical to a selected sequence to produce native-like functional 
peptides, which may be less immunogenic. Generally, during back-crossing 
subsequent selection is applied to retain the property of binding to the predetermined 
receptor (ligand). 

25 

Prior to or concomitant with the shuffling of selected sequences, the sequences 
can be mutagenized. In one embodiment, selected library members are cloned in a 
prokaryotic vector (e.g., plasmid, phagemid, or bacteriophage) wherein a collection of 
individual colonies (or plaques) representing discrete library members are produced. 
30 Individual selected library members can then be manipulated (e.g., by site-directed 
mutagenesis, cassette mutagenesis, chemical mutagenesis, PCR mutagenesis, and the 
like) to generate a collection of library members representing a kernal of sequence 
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diversity based on the sequence of the selected library member. The sequence of an 
individual selected library member or pool can be manipulated to incorporate random 
mutation, pseudorandom mutation, defined kernal mutation (i.e., comprising variant 
and invariant residue positions and/or comprising variant residue positions which can 
5 comprise a residue selected from a defined subset of amino acid residues), 

codon-based mutation, and the like, either segmentally or over the entire length of the 
individual selected library member sequence. The mutagenized selected library 
members are then shuffled by in vitro and/or in vivo recombinatorial shuffling as 
disclosed herein. 

10 

The invention also provides peptide libraries comprising a plurality of 
individual library members of the invention, wherein (1) each individual library 
member of said plurality comprises a sequence produced by shuffling of a pool of 
selected sequences, and (2) each individual library member comprises a variable 
15 peptide segment sequence or single-chain antibody segment sequence which is 
distinct from the variable peptide segment sequences or single-chain antibody 
sequences of other individual library members in said plurality (although some library 
members may be present in more than one copy per library due to uneven 
amplification, stochastic probability, or the like). 

20 

The invention also provides a product-by-process, wherein selected 
polynucleotide sequences having (or encoding a peptide having) a predetermined 
binding specificity are formed by the process of: (1) screening a displayed peptide or 
displayed single-chain antibody library against a predetermined receptor (e.g., ligand) 

25 or epitope (e.g., antigen macromolecule) and identifying and/or enriching library 
members which bind to the predetermined receptor or epitope to produce a pool of 
selected library members, (2) shuffling by recombination the selected library 
members (or amplified or cloned copies thereof) which binds the predetermined 
epitope and has been thereby isolated and/or enriched from the library to generate a 

30 shuffled library, and (3) screening the shuffled library against the predetermined 

receptor (e.g., ligand) or epitope (e.g., antigen macromolecule) and identifying and/or 
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enriching shuffled library members which bind to the predetermined receptor or 
epitope to produce a pool of selected shuffled library members. 

Antibody Display and Screening Methods - ■ - 

5 The present method can be used to shuffle, by in vitro and/or in vivo 

recombination by any of the disclosed methods, and in any combination, 
polynucleotide sequences selected by antibody display methods, wherein an 
associated polynucleotide encodes a displayed antibody which is screened for a 
phenotype (e.g., for affinity for binding a predetermined antigen (ligand)). 

10 

Various molecular genetic approaches have been devised to capture the vast 
immunological repertoire represented by the extremely large number of distinct 
variable regions which can be present in immunoglobulin chains. The 
naturally-occurring germ line immunoglobulin heavy chain locus is composed of 

15 separate tandem arrays of variable segment genes located upstream of a tandem array 
of diversity segment genes, which are themselves located upstream of a tandem array 
of joining (i) region genes, which are located upstream of the constant region genes. 
During B lymphocyte development, V-D-J rearrangement occurs wherein a heavy 
chain variable region gene (VH) is formed by rearrangement to form a fused D 

20 segment followed by rearrangement with a V segment to form a V-D-J joined product 
gene which, if productively rearranged, encodes a functional variable region (VH) of 
a heavy chain. Similarly, light chain loci rearrange one of several V segments with 
one of several J segments to form a gene encoding the variable region (VL) of a light 
chain. 

25 

The vast repertoire of variable regions possible in immunoglobulins derives in 
part from the numerous combinatorial possibilities of joining V and i segments (and, 
in the case of heavy chain loci, D segments) during rearrangement in B cell 
development. Additional sequence diversity in the heavy chain variable regions arises 
30 from non-uniform rearrangements of the D segments during V-D-J joining and from 
N region addition. Further, antigen-selection of specific B cell clones selects for 
higher affinity variants having non-germline mutations in one or both of the heavy 
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and light chain variable regions; a phenomenon referred to as "affinity maturation" or 
"affinity sharpening". Typically, these "affinity sharpening" mutations cluster in 
specific areas of the variable region, most commonly in the 
complementarity-determining regions (CDRs). 

5 

In order to overcome many of the limitations in producing and identifying 
high-affinity immunoglobulins through antigen-stimulated B cell development (i.e., 
immunization), various prokaryotic expression systems have been developed that can 
be manipulated to produce combinatorial antibody libraries which may be screened 
10 for high-affinity antibodies to specific antigens. Recent advances in the expression of 
antibodies in Escherichia coli and bacteriophage systems {see "alternative peptide 
display methods", infra) have raised the possibility that virtually any specificity can 
be obtained by either cloning antibody genes from characterized hybridomas or by de 
novo selection using antibody gene libraries (e.g., from Ig cDNA). 

15 

Combinatorial libraries of antibodies have been generated in bacteriophage 
lambda expression systems which may be screened as bacteriophage plaques or as 
colonies of lysogens (Huse et al, 1989); Caton and Koprowski, 1990; Mullinax et al, 
1990; Persson et al, 1991). Various embodiments of bacteriophage antibody display 

20 libraries and lambda phage expression libraries have been described (Kang et al, 
1991; Clackson <?/ a/., 1991; McCafferty et al, 1990; Burton et al, 1991; 
Hoogenboom et al, 1991; Chang et al, 1991; Breitling et al, 1991; Marks et al, 
1991, p. 581; Barbas et al, 1992; Hawkins and Winter, 1992; Marks et al, 1992 3 p. 
779; Marks et al, 1992, p. 16007; and Lowman et al, 1991; Lerner et al, 1992; all 

25 incorporated herein by reference). Typically, a bacteriophage antibody display library 
is screened with a receptor (e.g., polypeptide, carbohydrate, glycoprotein, nucleic 
acid) that is immobilized (e.g., by covalent linkage to a chromatography resin to 
enrich for reactive phage by affinity chromatography) and/or labeled (e.g., to screen 
plaque or colony lifts). 

30 

One particularly advantageous approach has been the use of so-called 
single-chain fragment variable (scfv) libraries (Marks et al, 1992, p. 779; Winter and 
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Milstein, 1991; Clackson et al, 1991; Marks et al, 1991, p. 581; Chaudhary et al. } 
1990; Chiswell et al, 1992; McCafferty et al, 1990; and Huston et al., 1988). 
Various embodiments of scfV libraries displayed on bacteriophage coat proteins have 
been described. 

5 

Beginning in 1988, single-chain analogues of Fv fragments and their fusion 
proteins have been reliably generated by antibody engineering methods. The first step 
generally involves obtaining the genes encoding VH and VL domains with desired 
binding properties; these V genes may be isolated from a specific hybridoma cell line, 

10 selected from a combinatorial V-gene library, or made by V gene synthesis. The 
single-chain Fv is formed by connecting the component V genes with an 
oligonucleotide that encodes an appropriately designed linker peptide, such as 
(Gly-Gly-Gly-Gly-Ser) or equivalent linker peptide(s). The linker bridges the 
C-terminus of the first V region and N-terminus of the second, ordered as either 

1 5 VH-linker-VL or VL-linker-VH' In principle, the scfV binding site can faithfully 
replicate both the affinity and specificity of its parent antibody combining site. 

Thus, scfv fragments are comprised of VH and VL domains linked into a 
single polypeptide chain by a flexible linker peptide. After the scfv genes are 
20 assembled, they are cloned into a phagemid and expressed at the tip of the Ml 3 phage 
(or similar filamentous bacteriophage) as fusion proteins with the bacteriophage PHI 
(gene 3) coat protein. Enriching for phage expressing an antibody of interest is 
accomplished by panning the recombinant phage displaying a population scfv for 
binding to a predetermined epitope {e.g., target antigen, receptor). 

25 

The linked polynucleotide of a library member provides the basis for 
replication of the library member after a screening or selection procedure, and also 
provides the basis for the determination, by nucleotide sequencing, of the identity of 
the displayed peptide sequence or VH and VL amino acid sequence. The displayed 
30 peptide (s) or single-chain antibody {e.g., scfv) and/or its VH and VL domains or their 
CDRs can be cloned and expressed in a suitable expression system. Often 
polynucleotides encoding the isolated VH and VL domains will be ligated to 
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polynucleotides encoding constant regions (CH and CL) to form polynucleotides 
encoding complete antibodies (e.g., chimeric or fully-human), antibody fragments, 
and the like. Often polynucleotides encoding the isolated CDRs will be grafted into 
polynucleotides encoding a suitable variable region framework (and optionally 
5 constant regions) to form polynucleotides encoding complete antibodies (e.g., 

humanized or fully-human), antibody fragments, and the like. Antibodies can be used 
to isolate preparative quantities of the antigen by immunoaffinity chromatography. 
Various other uses of such antibodies are to diagnose and/or stage disease (e.g., 
neoplasia) and for therapeutic application to treat disease, such as for example: 
10 neoplasia, autoimmune disease, AIDS, cardiovascular disease, infections, and the like. 

Various methods have been reported for increasing the combinatorial diversity 
of a scfV library to broaden the repertoire of binding species (idiotype spectrum) The 
use of PCR has permitted the variable regions to be rapidly cloned either from a 

15 specific hybridoma source or as a gene library from non-immunized cells, affording 
combinatorial diversity in the assortment of VH and VL cassettes which can be 
combined. Furthermore, the VH and VL cassettes can themselves be diversified, such 
as by random, pseudorandom, or directed mutagenesis. Typically, VH and VL 
cassettes are diversified in or near the complementarity-determining regions (CDRS), 

20 often the third CDR, CDR3. Enzymatic inverse PCR mutagenesis has been shown to 
be a simple and reliable method for constructing relatively large libraries of scfv 
site-directed hybrids (Stemmer et al, 1993), as has error-prone PCR and chemical 
mutagenesis (Deng et al, 1994). Riechmann (Riechmann et al. , 1993) showed semi- 
rational design of an antibody scfv fragment using site-directed randomization by 

25 degenerate oligonucleotide PCR and subsequent phage display of the resultant scfv 
hybrids. Barbas (Barbas et al, 1992) attempted to circumvent the problem of limited 
repertoire sizes resulting from using biased variable region sequences by randomizing 
the sequence in a synthetic CDR region of a human tetanus toxoid-binding Fab. 

30 CDR randomization has the potential to create approximately 1 x 10 20 CDRs 

for the heavy chain CDR3 alone, and a roughly similar number of variants of the 
heavy chain CDR1 and CDR2, and light chain CDR1-3 variants. Taken individually 
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or together, the combination possibilities of CDR randomization of heavy and/or light 
chains requires generating a prohibitive number of bacteriophage clones to produce a 
clone library representing all possible combinations, the vast majority of which will 
be non-binding. Generation of such large numbers of primary transformants is not 
5 feasible with current transformation technology and bacteriophage display systems. 
For example, Barbas (Barbas et a/., 1992) only generated 5 x 10 7 transformants, 
which represents only a tiny fraction of the potential diversity of a library of 
thoroughly randomized CDRS. 

10 Despite these substantial limitations, bacteriophage, display of scfV have 

already yielded a variety of useful antibodies and antibody fusion proteins. A 
bispecific single chain antibody has been shown to mediate efficient tumor cell lysis 
(Gruber et al., 1994). Intracellular expression of an anti-Rev scfv has been shown to 
inhibit HIV-1 virus replication in vitro (Duan et al, 1994), and intracellular 

1 5 expression of an anti-p21rar, scfv has been shown to inhibit meiotic maturation of 
Xenopus oocytes (Biocca et ah, 1993). Recombinant scfv which can be used to 
diagnose HIV infection have also been reported, demonstrating the diagnostic utility 
of scfv (Lilley et ah, 1994). Fusion proteins wherein an scFv is linked to a second 
polypeptide, such as a toxin or fibrinolytic activator protein, have also been reported 

20 (Holvost et al, 1 992; Nicholls et ah, 1 993). 

If it were possible to generate scfV libraries having broader antibody diversity 
and overcoming many of the limitations of conventional CDR mutagenesis and 
randomization methods which can cover only a very tiny fraction of the potential 

25 sequence combinations, the number and quality of scfV antibodies suitable for 

therapeutic and diagnostic use could be vastly improved. To address this, the in vitro 
and in vivo shuffling methods of the invention are used to recombine CDRs which 
have been obtained (typically via PCR amplification or cloning) from nucleic acids 
obtained from selected displayed antibodies. Such displayed antibodies can be 

30 displayed on cells, on bacteriophage particles, on polysomes, or any suitable antibody 
display system wherein the antibody is associated with its encoding nucleic acid(s). 
In a variation, the CDRs are initially obtained from mRNA (or cDNA) from 
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antibody-producing cells (e.g., plasma cells/splenocytes from an immunized wild-type 
mouse, a human, or a transgenic mouse capable of making a human antibody as in 
WO 92/03918, WO 93/12227, and WO 94/25585), including hybridomas derived 
therefrom. 

5 

Polynucleotide sequences selected in a first selection round (typically by 
affinity selection for displayed antibody binding to an antigen (e.g. , a ligand) by any 
of these methods are pooled and the pool(s) is/are shuffled by in vitro and/or in vivo 
recombination, especially shuffling of CDRs (typically shuffling heavy chain CDRs 

10 with other heavy chain CDRs and light chain CDRs with other light chain CDRs) to 
produce a shuffled pool comprising a population of recombined selected 
polynucleotide sequences. The recombined selected polynucleotide sequences are 
expressed in a selection format as a displayed antibody and subjected to at least one 
subsequent selection round. The polynucleotide sequences selected in the subsequent 

1 5 selection round(s) can be used directly, sequenced, and/or subjected to one or more 
additional rounds of shuffling and subsequent selection until an antibody of the 
desired binding affinity is obtained. Selected sequences can also be back-crossed 
with polynucleotide sequences encoding neutral antibody framework sequences (i.e., 
having insubstantial functional effect on antigen binding), such as for example by 

20 back-crossing with a human variable region framework to produce human-like 

sequence antibodies. Generally, during back-crossing subsequent selection is applied 
to retain the property of binding to the predetermined antigen. 

Alternatively, or in combination with the noted variations, the valency of the 
25 target epitope may be varied to control the average binding affinity of selected scfv 
library members. The target epitope can be bound to a surface or substrate at varying 
densities, such as by including a competitor epitope, by dilution, or by other method 
known to those in the art. A high density (valency) of predetermined epitope can be 
used to enrich for scfv library members which have relatively low affinity, whereas a 
30 low density (valency) can preferentially enrich for higher affinity scfV library 
members. 
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For generating diverse variable segments, a collection of synthetic 
oligonucleotides encoding random, pseudorandom, or a defined sequence kernal set of 
peptide sequences can be inserted by ligation into a predetermined site (e.g., a CDR). 
Similarly, the sequence diversity of one or more CDRs of the single-chain antibody 
5 cassette(s) can be expanded by mutating the CDR(s) with site-directed mutagenesis, 
CDR-replacement, and the like. The resultant DNA molecules can be propagated in a 
host for cloning and amplification prior to shuffling, or can be used directly (Le. 9 may 
avoid loss of diversity which may occur upon propagation in a host cell) and the 
selected library members subsequently shuffled. 

10 

Displayed peptide/polynucleotide complexes (library members) which encode 
a variable segment peptide sequence of interest or a single-chain antibody of interest 
are selected from the library by an affinity enrichment technique. This is 
accomplished by means of a immobilized macromolecule or epitope specific for the 
1 5 peptide sequence of interest, such as a receptor, other macromolecule, or other epitope 
species. Repeating the affinity selection procedure provides an enrichment of library 
members encoding the desired sequences, which may then be isolated for pooling and 
shuffling, for sequencing, and/or for further propagation and affinity enrichment. 

20 The library members without the desired specificity are removed by washing. 

The degree and stringency of washing required will be determined for each peptide 
sequence or single-chain antibody of interest and the immobilized predetermined 
macromolecule or epitope. A certain degree of control can be exerted over the 
binding characteristics of the nascent peptide/DNA complexes recovered by adjusting 

25 the conditions of the binding incubation and the subsequent washing. The 

temperature, pH, ionic strength, divalent cations concentration, and the volume and 
duration of the washing will select for nascent peptide/DNA complexes within 
particular ranges of affinity for the immobilized macromolecule. Selection based on 
slow dissociation rate, which is usually predictive of high affinity, is often the most 

30 practical route. This may be done either by continued incubation in the presence of a 
saturating amount of free predetermined macromolecule, or by increasing the volume, 
number, and length of the washes. In each case, the rebinding of dissociated nascent 
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peptide/DNA or peptide/RNA complex is prevented, and with increasing time, 
nascent peptide/DNA or peptide/RNA complexes of higher and higher affinity are 
recovered. 

5 Additional modifications of the binding and washing procedures may be 

applied to find peptides with special characteristics. The affinities of some peptides 
are dependent on ionic strength or cation concentration. This is a useful characteristic 
for peptides that will be used in affinity purification of various proteins when gentle 
conditions for removing the protein from the peptides are required. 

10 

One variation involves the use of multiple binding targets (multiple epitope 
species, multiple receptor species), such that a scfv library can be simultaneously 
screened for a multiplicity of scfv which have different binding specificities. Given 
that the size of a scfv library often limits the diversity of potential scfv sequences, it is 

15 typically desirable to us scfv libraries of as large a size as possible. The time and 

economic considerations of generating a number of very large polysome scFv-display 
libraries can become prohibitive. To avoid this substantial problem, multiple 
predetermined epitope species (receptor species) can be concomitantly screened in a 
single library, or sequential screening against a number of epitope species can be 

20 used. In one variation, multiple target epitope species, each encoded on a separate , 
bead (or subset of beads), can be mixed and incubated with a polysome-display scfv 
library under suitable binding conditions. The collection of beads, comprising 
multiple epitope species, can then be used to isolate, by affinity selection, scfv library 
members. Generally, subsequent affinity screening rounds can include the same 

25 mixture of beads, subsets thereof, or beads containing only one or two individual 
epitope species. This approach affords efficient screening, and is compatible with 
laboratory automation, batch processing, and high throughput screening methods. 

A variety of techniques can be used in the present invention to diversify a 
30 peptide library or single-chain antibody library, or to diversify, prior to or 

concomitant with shuffling, around variable segment peptides found in early rounds 
of panning to have sufficient binding activity to the predetermined macromolecule or 
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epitope. In one approach, the positive selected peptide/polynucleotide complexes 
(those identified in an early round of affinity enrichment) are sequenced to determine 
the identity of the active peptides. Oligonucleotides are then synthesized based on 
these active peptide sequences, employing a low level of all bases incorporated at 
5 each step to produce slight variations of the primary oligonucleotide sequences. This 
mixture of (slightly) degenerate oligonucleotides is then cloned into the variable 
segment sequences at the appropriate locations. This method produces systematic, 
controlled variations of the starting peptide sequences, which can then be shuffled. It 
requires, however, that individual positive nascent peptide/polynucleotide complexes 

10 be sequenced before mutagenesis, and thus is useful for expanding the diversity of 
small numbers of recovered complexes and selecting variants having higher binding 
affinity and/or higher binding specificity. In a variation, mutagenic PCR 
amplification of positive selected peptide/polynucleotide complexes (especially of the 
variable region sequences, the amplification products of which are shuffled in vitro 

1 5 and/or in vivo and one or more additional rounds of screening is done prior to 
sequencing. The same general approach can be employed with single-chain 
antibodies in order to expand the diversity and enhance the binding 
affinity/specificity, typically by diversifying CDRs or adjacent framework regions 
prior to or concomitant with shuffling. If desired, shuffling reactions can be spiked 

20 with mutagenic oligonucleotides capable of in vitro recombination with the selected 
library members can be included. Thus, mixtures of synthetic oligonucleotides and 
PCR produced polynucleotides (synthesized by error-prone or high-fidelity methods) 
can be added to the in vitro shuffling mix and be incorporated into resulting shuffled 
library members (shufflants). 

25 

The invention of shuffling enables the generation of a vast library of 
CDR-variant single-chain antibodies. One way to generate such antibodies is to insert 
synthetic CDRs into the single-chain antibody and/or CDR randomization prior to or 
concomitant with shuffling. The sequences of the synthetic CDR cassettes are 
30 selected by referring to known sequence data of human CDR and are selected in the 
discretion of the practitioner according to the following guidelines: synthetic CDRs 
will have at least 40 percent positional sequence identity to known CDR sequences, 
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and preferably will have at least 50 to 70 percent positional sequence identity to 
known CDR sequences. For example, a collection of synthetic CDR sequences can 
be generated by synthesizing a collection of oligonucleotide sequences on the basis of 
naturally-occurring human CDR sequences listed in Kabat (Kabat et al., 1991); the 
5 pool (s) of synthetic CDR sequences are calculated to encode CDR peptide sequences 
having at least 40 percent sequence identity to at least one known naturally-occurring 
human CDR sequence. Alternatively, a collection of naturally-occurring CDR 
sequences may be compared to generate consensus sequences so that amino acids 
used at a residue position frequently (i.e., in at least 5 percent of known CDR 

10 sequences) are incorporated into the synthetic CDRs at the corresponding position(s). 
Typically, several (e.g., 3 to about 50) known CDR sequences are compared and 
observed natural sequence variations between the known CDRs are tabulated, and a 
collection of oligonucleotides encoding CDR peptide sequences encompassing all or 
most permutations of the observed natural sequence variations is synthesized. For 

15 example but not for limitation, if a collection of human VH CDR sequences have 

carboxy-terminal amino acids which are either Tyr, Val, Phe, or Asp, then the pool(s) 
of synthetic CDR oligonucleotide sequences are designed to allow the 
carboxy-terminal CDR residue to be any of these amino acids. In some embodiments, 
residues other than those which naturally-occur at a residue position in the collection 

20 of CDR sequences are incorporated: conservative amino acid substitutions are 

frequently incorporated and up to 5 residue positions may be varied to incorporate 
non-conservative amino acid substitutions as compared to known naturally-occurring 
CDR sequences. Such CDR sequences can be used in primary library members (prior 
to first round screening) and/or can be used to spike in vitro shuffling reactions of 

25 selected library member sequences. Construction of such pools of defined and/or 
degenerate sequences will be readily accomplished by those of ordinary skill in the 
art. 

The collection of synthetic CDR sequences comprises at least one member 
30 that is not known to be a naturally-occurring CDR sequence. It is within the 
discretion of the practitioner to include or not include a portion of random or 
pseudorandom sequence corresponding to N region addition in the heavy chain CDR; 
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the N region sequence ranges from 1 nucleotide to about 4 nucleotides occurring at 
V-D and D-J junctions. A collection of synthetic heavy chain CDR sequences 
comprises at least about 100 unique CDR sequences, typically at least about 1,000 
unique CDR sequences, preferably at least about 10,000 unique CDR sequences, 
5 frequently more than 50,000 unique CDR sequences; however, usually not more than 
about 1x10 6 unique CDR sequences are included in the collection, although 
occasionally 1 x 107 to 1 X 108 unique CDR sequences are present, especially if 
conservative amino acid substitutions are permitted at positions where the 
conservative amino acid substituent is not present or is rare (i.e., less than 0.1 percent) 

10 in that position in naturally-occurring human CDRS. In general, the number of 

unique CDR sequences included in a library should not exceed the expected number 
of primary transformants in the library by more than a factor of 10. Such single-chain 
antibodies generally bind of about at least 1x10 m-, preferably with an affinity of 
about at least 5 x 10 7 M-l, more preferably with an affinity of at least 1 x 10 8 M-l to 

15 1 x 10 9 M-l or more, sometimes up to 1 x 10 10 M-l or more. Frequently, the 

predetermined antigen is a human protein, such as for example a human cell surface 
antigen (e.g., CD4, CD8, IL-2 receptor, EGF receptor, PDGF receptor), other human 
biological macromolecule (e.g., thrombomodulin, protein C, carbohydrate antigen, 
sialyl Lewis antigen, Lselectin), or nonhuman disease associated macromolecule (e.g., 

20 bacterial LPS, virion capsid protein or envelope glycoprotein) and the like. 

High affinity single-chain antibodies of the desired specificity can be 
engineered and expressed in a variety of systems. For example, scfv have been 
produced in plants (Firek et ah, 1993) and can be readily made in prokaryotic systems 

25 (Owens and Young, 1994; Johnson and Bird, 1991). Furthermore, the single-chain 
antibodies can be used as a basis for constructing whole antibodies or various 
fragments thereof (Kettleborough et al., 1994). The variable region encoding 
sequence may be isolated (e.g., by PCR amplification or subcloning) and spliced to a 
sequence encoding a desired human constant region to encode a human sequence 

30 antibody more suitable for human therapeutic uses where immunogenicity is 
preferably minimized. The polynucleotide(s) having the resultant fully human 
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encoding sequence(s) can be expressed in a host cell {e.g. , from an expression vector 
in a mammalian cell) and purified for pharmaceutical formulation. 

Once expressed, the antibodies, individual mutated immunoglobulin chains, 
5 mutated antibody fragments, and other immunoglobulin polypeptides of the invention 
can be purified according to standard procedures of the art, including ammonium 
sulfate precipitation, fraction column chromatography, gel electrophoresis and the like 
(see, generally, Scopes, 1982). Once purified, partially or to homogeneity as desired, 
the polypeptides may then be used therapeutically or in developing and performing 
10 assay procedures, immunofluorescent stainings, and the like (see, generally, Lefkovits 
and Pernis, 1979 and 1981; Lefkovits, 1997). 

The antibodies generated by the method of the present invention can be used 
for diagnosis and therapy. By way of illustration and not limitation, they can be used 
15 to treat cancer, autoimmune diseases, or viral infections. For treatment of cancer, the 
antibodies will typically bind to an antigen expressed preferentially on cancer cells, 
such as erbB-2, CEA, CD33, and many other antigens and binding members well 
known to those skilled in the art. 
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Two-Hybrid Based Screening Assays 

Shuffling can also be used to recombinatorially diversify a pool of selected 
library members obtained by screening a two-hybrid screening system to identify 
5 library members which bind a predetermined polypeptide sequence. The selected 
library members are pooled and shuffled by in vitro and/or in vivo recombination. 
The shuffled pool can then be screened in a yeast two hybrid system to select library 
members which bind said predetermined polypeptide sequence (e.g., and SH2 
domain) or which bind an alternate predetermined polypeptide sequence (e.g., an SH2 
10 domain from another protein species). 

An approach to identifying polypeptide sequences which bind to a 
predetermined polypeptide sequence has been to use a so-called "two-hybrid" system 
wherein the predetermined polypeptide sequence is present in a fusion protein (Chien 

15 et al., 1991). This approach identifies protein-protein interactions in vivo through 
reconstitution of a transcriptional activator (Fields and Song, 1989), the yeast Gal4 
transcription protein. Typically, the method is based on the properties of the yeast 
Gal4 protein, which consists of separable domains responsible for DNA-binding and 
transcriptional activation. Polynucleotides encoding two hybrid proteins, one 

20 consisting of the yeast Gal4 DNA-binding domain fused to a polypeptide sequence of 
a known protein and the other consisting of the Gal4 activation domain fused to a 
polypeptide sequence of a second protein, are constructed and introduced into a yeast 
host cell. Intermolecular binding between the two fusion proteins reconstitutes the 
Gal4 DNA-binding domain with the Gal4 activation domain, which leads to the 

25 transcriptional activation of a reporter gene (e.g., lacz, HIS3) which is operably linked 
to a Gal4 binding site. Typically, the two-hybrid method is used to identify novel 
polypeptide sequences which interact with a known protein (Silver and Hunt, 1993; 
Durfeeefa/., 1993; Yang et al, 1992; Luban et al, 1993; Hardy et al, 1992; BarteUr 
al, 1993; and Vojtek et al, 1993). However, variations of the two-hybrid method 

30 have been used to identify mutations of a known protein that affect its binding to a 
second known protein (Li and Fields, 1993; Lalo et al, 1993; Jackson et al, 1993; 
and Madura et al, 1993). Two-hybrid systems have also been used to identify 



Gray CaryXGTtf 190 100.2 
104703-1 



92 



interacting structural domains of two known proteins (Bardwell et al, 1993; 
Chakrabarty et al, 1992; Staudinger et al, 1993; and Milne and Weaver 1993) or 
domains responsible for oligomerization of a single protein (Iwabuchi et al, 1993; 
Bogerd et al, 1993). Variations of two-hybrid systems have been used to study the in 
5 vivo activity of a proteolytic enzyme (Dasmahapatra et al, 1992). Alternatively, an E. 
coli/BCCP interactive screening system (Germino et al, 1993; Guarente, 1993) can 
be used to identify interacting protein sequences (i.e., protein sequences which 
heterodimerize or form higher order heteromultimers). Sequences selected by a two- 
hybrid system can be pooled and shuffled and introduced into a two-hybrid system for 
10 one or more subsequent rounds of screening to identify polypeptide sequences which 
bind to the hybrid containing the predetermined binding sequence. The sequences 
thus identified can be compared to identify consensus sequence(s) and consensus 
sequence kemals. 

1 5 One microgram samples of template DNA are obtained and treated with U. V. 

light to cause the formation of dimers, including TT dimers, particularly purine 
dimers. U.V. exposure is limited so that only a few photoproducts are generated per 
gene on the template DNA sample. Multiple samples are treated with U.V. light for 
varying periods of time to obtain template DNA samples with varying numbers of 

20 dimers from U.V. exposure. 

A random priming kit which utilizes a non-proofreading polymerase (for 
example, Prime-It II Random Primer Labeling kit by Stratagene Cloning Systems) is 
utilized to generate different size polynucleotides by priming at random sites on 

25 templates which are prepared by U.V. light (as described above) and extending along 
the templates. The priming protocols such as described in the Prime-It II Random 
Primer Labeling kit may be utilized to extend the primers. The dimers formed by 
U.V. exposure serve as a roadblock for the extension by the non-proofreading 
polymerase. Thus, a pool of random size polynucleotides is present after extension 

30 with the random primers is finished. 
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The invention is further directed to a method for generating a selected mutant 
polynucleotide sequence (or a population of selected polynucleotide sequences) 
typically in the form of amplified and/or cloned polynucleotides, whereby the selected 
polynucleotide sequences(s) possess at least one desired phenotypic characteristic 
5 (e .g., encodes a polypeptide, promotes transcription of linked polynucleotides, binds a 
protein, and the like) which can be selected for. One method for identifying hybrid 
polypeptides that possess a desired structure or functional property, such as binding to 
a predetermined biological macromolecule (e.g., a receptor), involves the screening of 
a large library of polypeptides for individual library members which possess the 
10 desired structure or functional property conferred by the amino acid sequence of the 
polypeptide. 

In one embodiment, the present invention provides a method for generating 
libraries of displayed polypeptides or displayed antibodies suitable for affinity 

15 interaction screening or phenotypic screening. The method comprises (1) obtaining a 
first plurality of selected library members comprising a displayed polypeptide or 
displayed antibody and an associated polynucleotide encoding said displayed 
polypeptide or displayed antibody, and obtaining said associated polynucleotides or 
copies thereof wherein said associated polynucleotides comprise a region of 

20 substantially identical sequences, optimally introducing mutations into said 

polynucleotides or copies, (2) pooling the polynucleotides or copies, (3) producing 
smaller or shorter polynucleotides by interrupting a random or particularized priming 
and synthesis process or an amplification process, and (4) performing amplification, 
preferably PCR amplification, and optionally mutagenesis to homologously 

25 recombine the newly synthesized polynucleotides. 

It is an object of the invention to provide a process for producing hybrid 
polynucleotides which express a useful hybrid polypeptide by a series of steps 
comprising: 

30 (a) producing polynucleotides by interrupting a polynucleotide 

amplification or synthesis process with a means for blocking or interrupting the 
amplification or synthesis process and thus providing a plurality of smaller or shorter 
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polynucleotides due to the replication of the polynucleotide being in various stages of 
completion; 

(b) adding to the resultant population of single- or double-stranded 
polynucleotides one or more single- or double-stranded oligonucleotides, wherein said 

5 added oligonucleotides comprise an area of identity in an area of heterology to one or 
more of the single- or double-stranded polynucleotides of the population; 

(c) denaturing the resulting single- or double-stranded oligonucleotides to 
produce a mixture of single-stranded polynucleotides, optionally separating the 
shorter or smaller polynucleotides into pools of polynucleotides having various 

10 lengths and further optionally subjecting said polynucleotides to a PCR procedure to 
amplify one or more oligonucleotides comprised by at least one of said polynucleotide 
pools; 

(d) incubating a plurality of said polynucleotides or at least one pool of 
said polynucleotides with a polymerase under conditions which result in annealing of 

1 5 said single-stranded polynucleotides at regions of identity between the single-stranded 
polynucleotides and thus forming of a mutagenized double-stranded polynucleotide 
chain; 

(e) optionally repeating steps (c) and (d); 

(f) expressing at least one hybrid polypeptide from said polynucleotide 
20 chain, or chains; and 

(g) screening said at least one hybrid polypeptide for a useful activity. 
In a preferred aspect of the invention, the means for blocking or interrupting 

the amplification or synthesis process is by utilization of uv light, DNA adducts, DNA 
binding proteins. 

25 

In one embodiment of the invention, the DNA adducts, or polynucleotides 
comprising the DNA adducts, are removed from the polynucleotides or 
polynucleotide pool, such as by a process including heating the solution comprising 
the DNA fragments prior to further processing. 

30 
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Sequencing 



The clones enriched for a desired polynucleotide sequence, which are 
identified as described above, may be sequenced to identify the DNA sequence(s) 
5 present in the clone, which sequence information can be used to screen a database for 
similar sequences or functional characteristics. Thus, in accordance with the present 
invention it is possible to isolate and identify: (i) DNA having a sequence of interest 
(e.g., a sequence encoding an enzyme having a specified enzyme activity), (ii) 
associate the sequence with known or unknown sequence in a database (e.g., database 
1 0 sequence associated with an enzyme having an activity (including the amino acid 
sequence thereof)), and (iii) produce recombinant enzymes having such activity. 

Sequencing may be performed by high through-put sequencing techniques. The 
exact method of sequencing is not a limiting factor of the invention. Any method useful 
in identifying the sequence of a particular cloned DNA sequence can be used. In 

1 5 general, sequencing is an adaptation of the natural process of DNA replication. 

Therefore, a template (e.g., the vector) and primer sequences are used. One general 
template preparation and sequencing protocol begins with automated picking of bacterial 
colonies, each of which contains a separate DNA clone which will function as a template 
for the sequencing reaction. The selected clones are placed into media, and grown 

20 overnight. The DNA templates are then purified from the cells and suspended in water. 
After DNA quantification, high-throughput sequencing is performed using a sequencers, 
such as Applied Biosystems, Inc., Prism 377 DNA Sequencers. The resulting sequence 
data can then be used in additional methods, including to search a database or databases. 

Database Searches and Alignment Algorithms 

25 A number of source databases are available that contain either a nucleic acid 

sequence and/or a deduced amino acid sequence for use with the invention in identifying 
or determining the activity encoded by a particular polynucleotide sequence. All or a 
representative portion of the sequences (e.g., about 100 individual clones) to be tested 
are used to search a sequence database (e.g., GenBank, PFAM or ProDom), either 

30 simultaneously or individually. A number of different methods of performing such 

Gray Cary\GT\6 190 100.2 96 
104703-1 



sequence searches are known in the art. The databases can be specific for a particular 
organism or a collection of organisms. For example, there are databases for the C. 
elegans, Arabadopsis. sp., M. genitalium, M.jannaschii, E. coli, H. influenzae, S. 
cerevisiae and others. The sequence data of the clone is then aligned to the sequences in 
5 the database or databases using algorithms designed to measure homology between two 
or more sequences. 

Such sequence alignment methods include, for example, BLAST (Altschul et al, 
1990), BLITZ (MPsrch) (Sturrock & Collins, 1993), and FASTA (Person & Lipman, 
1988). The probe sequence (e.g f the sequence data from the clone) can be any length, 

10 and will be recognized as homologous based upon a threshold homology value. The 
threshold value may be predetermined, although this is not required. The threshold 
value can be based upon the particular polynucleotide length. To align sequences a 
number of different procedures can be used. Typically, Smith- Waterman or Needleman- 
Wunsch algorithms are used. However, as discussed faster procedures such as BLAST, 

1 5 FASTA, PSI-BLAST can be used. 

For example, optimal alignment of sequences for aligning a comparison window 
may be conducted by the local homology algorithm of Smith (Smith and Waterman, 
Adv Appl Math, 1981; Smith and Waterman, J Teor Biol, 1981; Smith and Waterman, J 
Mol Biol, 1981; Smith et al, J Mol Evol, 1981), by the homology alignment algorithm of 

20 Needleman (Needleman and Wuncsch, 1970), by the search of similarity method of 
Pearson (Pearson and Lipman, 1988), by computerized implementations of these 
algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software 
Package Release 7.0, Genetics Computer Group, 575 Science Dr., Madison, WI, or the 
Sequence Analysis Software Package of the Genetics Computer Group, University of 

25 Wisconsin, Madison, WI), or by inspection, and the best alignment (i.e. , resulting in the 
highest percentage of homology over the comparison window) generated by the various 
methods is selected. The similarity of the two sequence (i.e., the probe sequence and the 
database sequence) can then be predicted. 

Such software matches similar sequences by assigning degrees of homology to 
30 various deletions, substitutions and other modifications. The terms "homology" and 
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"identity" in the context of two or more nucleic acids or polypeptide sequences, refer to 
two or more sequences or subsequences that are the same or have a specified percentage 
of amino acid residues or nucleotides that are the same when compared and aligned for 
maximum correspondence over a comparison window or designated region as measured 
5 using any number of sequence comparison algorithms or by manual alignment and 
visual inspection. 

For sequence comparison, typically one sequence acts as a reference sequence, to 
which test sequences are compared. When using a sequence comparison algorithm, test 
and reference sequences are entered into a computer, subsequence coordinates are 
1 0 designated, if necessary, and sequence algorithm program parameters are designated. 
Default program parameters can be used, or alternative parameters can be designated. 
The sequence comparison algorithm then calculates the percent sequence identities for 
the test sequences relative to the reference sequence, based on the program parameters. 

A "comparison window", as used herein, includes reference to a segment of any 
1 5 one of the number of contiguous positions selected from the group consisting of from 20 
to 600, usually about 50 to about 200, more usually about 100 to about 150 in which a 
sequence may be compared to a reference sequence of the same number of contiguous 
positions after the two sequences are optimally aligned. 

One example of a useful algorithm is BLAST and BLAST 2.0 algorithms, which 
20 are described in Altschul et al , Nuc. Acids Res. 25:3389-3402 (1 977) and Altschul et 
aL, J. Mol. Biol. 215:403-410 (1990), respectively. Software for performing BLAST 
analyses is publicly available through the National Center for Biotechnology Information 
(http://www.ncbi.nlm.nih.gov/). This algorithm involves first identifying high scoring 
sequence pairs (HSPs) by identifying short words of length W in the query sequence, 
25 which either match or satisfy some positive- valued threshold score T when aligned with 
a word of the same length in a database sequence. T is referred to as the neighborhood 
word score threshold (Altschul et al , supra). These initial neighborhood word hits act as 
seeds for initiating searches to find longer HSPs containing them. The word hits are 
extended in both directions along each sequence for as far as the cumulative alignment 
30 score can be increased. Cumulative scores are calculated using, for nucleotide 
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sequences, the parameters M (reward score for a pair of matching residues; always >0). 
The BLAST algorithm parameters W, T, and X determine the sensitivity and speed of 
the alignment. The BLASTN program (for nucleotide sequences) uses as defaults a 
wordlength (W) of 1 1, an expectation (E) of 10, M=5, N=-4 and a comparison of both 
5 strands. 

The BLAST algorithm also performs a statistical analysis of the similarity 
between two sequences (see, e.g., Karlin & Altschul, Proc. Natl. Acad. Sci. USA 
90:5873 (1993)). One measure of similarity provided by BLAST algorithm is the 
smallest sum probability (P(N)) 5 which provides an indication of the probability by 
1 0 which a match between two nucleotide sequences would occur by chance. For example, 
a nucleic acid is considered similar to a references sequence if the smallest sum 
probability in a comparison of the test nucleic acid to the reference nucleic acid is less 
than about 0.2, more preferably less than about 0.01, and most preferably less than about 
0.001. 

1 5 Sequence homology means that two polynucleotide sequences are homolgous 

(i.e. , on a nucleotide-by-nucleotide basis) over the window of comparison. A percentage 
of sequence identity or homology is calculated by comparing two optimally aligned 
sequences over the window of comparison, determining the number of positions at 
which the identical nucleic acid base (e.g., A, T, C, G, U, or I) occurs in both sequences 

20 to yield the number of matched positions, dividing the number of matched positions by 
the total number of positions in the window of comparison (i.e. , the window size), and 
multiplying the result by 100 to yield the percentage of sequence homology. This 
substantial homology denotes a characteristic of a polynucleotide sequence, wherein the 
polynucleotide comprises a sequence having at least 60 percent sequence homology, 

25 typically at least 70 percent homology, often 80 to 90 percent sequence homology, and 
most commonly at least 99 percent sequence homology as compared to a reference 
sequence of a comparison window of at least 25-50 nucleotides, wherein the percentage 
of sequence homology is calculated by comparing the reference sequence to the 
polynucleotide sequence which may include deletions or additions which total 20 

30 percent or less of the reference sequence over the window of comparison. 
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Sequences having sufficient homology can the be further identified by any 
annotations contained in the database, including, for example, species and activity 
information. Accordingly, in a typical environmental sample, a plurality of nucleic acid 
sequences will be obtained, cloned, sequenced and corresponding homologous 
5 sequences from a database identified. This information provides a profile of the 

polynucleotides present in the sample, including one or more features associated with the 
polynucleotide including the organism and activity associated with that sequence or any 
polypeptide encoded by that sequence based on the database information. As used herein 
"fingerprint" or "profile" refers to the fact that each sample will have associated with it a 
1 0 set of polynucleotides characteristic of the sample and the environment from which it 

was derived. Such a profile can include the amount and type of sequences present in the 
sample, as well as information regarding the potential activities encoded by the 
polynucleotides and the organisms from which polynucleotides were derived. This 
unique pattern is each sample's profile or fingerprint. 

15 In some instances it may be desirable to express a particular cloned 

polynucleotide sequence once its identity or activity is determined or an suggested 
identity or activity is associated with the polynucleotide. In such instances the desired 
clone, if not already cloned into an expression vector, is ligated downstream of a 
regulatory control element (e.g., a promoter or enhancer) and cloned into a suitable host 

20 cell. Expression vectors are commercially available along with corresponding host cells 
for use in the invention. 

As representative examples of expression vectors which may be used there may 
be mentioned viral particles, baculovirus, phage, plasmids, phagemids, cosmids, 
phosmids, bacterial artificial chromosomes, viral nucleic acid {e.g., vaccinia, adenovirus, 

25 foul pox virus, pseudorabies and derivatives of SV40), PI -based artificial chromosomes, 
yeast plasmids, yeast artificial chromosomes, and any other vectors specific for specific 
hosts of interest (such as bacillus, aspergillus, yeast, etc.) Thus, for example, the DNA 
may be included in any one of a variety of expression vectors for expressing a 
polypeptide. Such vectors include chromosomal, nonchromosomal and synthetic DNA 

30 sequences. Large numbers of suitable vectors are known to those of skill in the art, and 
are commercially available. The following vectors are provided by way of example; 
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Bacterial: pQE70 5 pQE60, pQE-9 (Qiagen), psiX174 5 pBluescript SK, pBluescript KS, 
pNH8A, pNH16a, pNH18A, pNH46A (Stratagene); pTRC99a, pKK223-3, pKK233-3, 
pDR540, pRIT5 (Pharmacia); Eukaryotic: pWLNEO, pSV2CAT, pOG44, pXTl, pSG 
(Stratagene), pSVK3, pBPV, pMSG, pSVL (Pharmacia). However, any other plasmid 
5 or vector may be used as long as they are replicable and viable in the host. 

The nucleic acid sequence in the expression vector is operatively linked to an 
appropriate expression control sequence(s) (promoter) to direct mRNA synthesis. 
Particular named bacterial promoters include lad, lacZ, T3, T7, gpt, lambda PR, PL and 
trp. Eukaryotic promoters include CMV immediate early, HSV thymidine kinase, early 

1 0 and late SV40, LTRs from retrovirus, and mouse metallothionein-I. Selection of the 

appropriate vector and promoter is well within the level of ordinary skill in the art. The 
expression vector also contains a ribosome binding site for translation initiation and a 
transcription terminator. The vector may also include appropriate sequences for 
amplifying expression. Promoter regions can be selected from any desired gene using 

1 5 CAT (chloramphenicol transferase) vectors or other vectors with selectable markers. 

In addition, the expression vectors preferably contain one or more selectable 
marker genes to provide a phenotypic trait for selection of transformed host cells such as 
dihydrofolate reductase or neomycin resistance for eukaryotic cell culture, or such as 
tetracycline or ampicillin resistance in E. coli. 

20 The nucleic acid sequence(s) selected, cloned and sequenced as hereinabove 

described can additionally be introduced into a suitable host to prepare a library which is 
screened for the desired enzyme activity. The selected nucleic acid is preferably already 
in a vector which includes appropriate control sequences whereby a selected nucleic acid 
encoding an enzyme may be expressed, for detection of the desired activity. The host 

25 cell can be a higher eukaryotic cell, such as a mammalian cell, or a lower eukaryotic cell, 
such as a yeast cell, or the host cell can be a prokaryotic cell, such as a bacterial cell. The 
selection of an appropriate host is deemed to be within the scope of those skilled in the 
art from the teachings herein. 

In some instances it may be desirable to perform an amplification of the nucleic 
30 acid sequence present in a sample or a particular clone that has been isolated. In this 
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embodiment the nucleic acid sequence is amplified by PCR reaction or similar reaction 
known to those of skill in the art. Commercially available amplification kits are 
available to carry out such amplification reactions. 

In addition, it is important to recognize that the alignment algorithms and 
5 searchable database can be implemented in computer hardware, software or a 

combination thereof. Accordingly, the isolation, processing and identification of nucleic 
acid sequences and the corresponding polypeptides encoded by those sequence can be 
implemented in and automated system. 

Without further elaboration, it is believed that one skilled in the art can, using 
10 the preceding description, utilize the present invention to its fullest extent. The 

following examples are to be considered illustrative and thus are not limiting of the 
remainder of the disclosure in any way whatsoever. 

Example 1 
DNA Isolation and Library Construction 

1 5 The following outlines the procedures used to generate a gene library from a 

mixed population of organisms. 

DNA isolation. DNA is isolated using the IsoQuick Procedure as per 
manufacturer's instructions (Orca, Research Inc., Bothell, WA). DNA can be 
20 normalized according to Example 2 below. Upon isolation the DNA is sheared by 
pushing and pulling the DNA through a 25 G double-hub needle and a 1-cc syringes 
about 500 times. A small amount is run on a 0.8% agarose gel to make sure the 
majority of the DNA is in the desired size range (about 3-6 kb). 

25 Blunt-ending DNA. The DNA is blunt-ended by mixing 45 jil of 10X Mung 

Bean Buffer, 2.0 ]il Mung Bean Nuclease (150 u/(il) and water to a final volume of 
405 jliL The mixture is incubate at 37°C for 15 minutes. The mixture is 
phenol/chloroform extracted followed by an additional chloroform extraction. One 
ml of ice cold ethanol is added to the final extract to precipitate the DNA. The DNA 

30 is precipitated for 10 minutes on ice. The DNA is removed by centrifugation in a 
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microcentrifuge for 30 minutes. The pellet is washed with 1 ml of 70% ethanol and 
repelleted in the microcentrifuge. Following centrifiigation the DNA is dried and 
gently resuspended in 26 jil of TE buffer. 



10 



Methylation of DNA. The DNA is methylated by mixing 4 |il of 10X EcoR I 
Methylase Buffer, 0.5 |il SAM (32 mM), 5.0 fil EcoR I Methylase (40 u/jil) and 
incubating at 37°C ? 1 hour. In order to insure blunt ends, add to the methylation 
reaction: 5.0 \i\ of 100 mM MgCl 2 , 8.0 \x\ of dNTP mix (2.5 mM of each dGTP, 
dATP, dTTP, dCTP), 4.0 \i\ of Klenow (5 u/|il) and incubate at 12°C for 30 minutes. 



After 30 minutes add 450 \i\ IX STE. The mixture is phenol/chloroform 
extracted once followed by an additional chloroform extraction. One ml of ice cold 
ethanol is added to the final extract to precipitate the DNA. The DNA is precipitated 
for 10 minutes on ice. The DNA is removed by centrifiigation in a microcentrifuge 
15 for 30 minutes. The pellet is washed with 1 ml of 70% ethanol, repelleted in the 
microcentrifuge and allowed to dry for 10 minutes. 

Ligation. The DNA is ligated by gently resuspending the DNA in 8 jlxI EcoR 
I adaptors (from Stratagene's cDNA Synthesis Kit), 1.0 \i\ of 10X Ligation Buffer, 1.0 
20 |lx1 of 1 0 mM rATP, 1 .0 ^il of T4 DNA Ligase (4Wu/^il) and incubating at 4°C for 2 
days. The ligation reaction is terminated by heating for 30 minutes at 70°C. 

Phosphorylation of adaptors. The adaptor ends are phosphorylated by 
mixing the ligation reaction with 1 .0 \x\ of 10X Ligation Buffer, 2.0 \x\ of lOmM 

25 rATP, 6.0 jal of H 2 0, 1 .0 \x\ of polynucleotide kinase (PNK) and incubating at 37°C 
for 30 minutes. After 30 minutes 31 \x\ H 2 0 and 5 ml 1 OX STE are added to the 
reaction and the sample is size fractionate on a Sephacryl S-500 spin column. The 
pooled fractions (1-3) are phenol/chloroform extracted once followed by an additional 
chloroform extraction. The DNA is precipitated by the addition of ice cold ethanol on 

30 ice for 10 minutes. The precipitate is pelleted by centrifiigation in a microfuge at high 
speed for 30 minutes. The resulting pellet is washed with 1 ml 70% ethanol, 
repelleted by centrifiigation and allowed to dry for 10 minutes. The sample is 
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resuspended in 10.5 jil TE buffer. Do not plate. Instead, ligate directly to lambda 
arms as above except use 2.5 \xl of DNA and no water. 

Sucrose Gradient (2.2 ml) Size Fractionation. Stop ligation by heating the 
5 sample to 65°C for 10 minutes. Gently load sample on 2.2 ml sucrose gradient and 
centrifuge in mini-ultracentrifuge at 45K, 20°C for 4 hours (no brake). Collect 
fractions by puncturing the bottom of the gradient tube with a 20G needle and 
allowing the sucrose to flow through the needle. Collect the first 20 drops in a Falcon 
2059 tube then collect 10 1-drop fractions (labeled 1-10). Each drop is about 60 jil in 

10 volume. Run 5 jj,1 of each fraction on a 0.8% agarose gel to check the size. Pool 

fractions 1-4 (about 10-1.5 kb) and, in a separate tube, pool fractions 5-7 (about 5-0.5 
kb). Add 1 ml ice cold ethanol to precipitate and place on ice for 10 minutes. Pellet 
the precipitate by centrifugation in a microfuge at high speed for 30 minutes. Wash 
the pellets by resuspending them in 1 ml 70% ethanol and repelleting them by 

15 centrifugation in a microfuge at high speed for 10 minutes and dry. Resuspend each 
pellet in 10 jil of TE buffer. 

Test Ligation to Lambda Arms. Plate assay by spotting 0.5 (il of the sample 
on agarose containing ethidium bromide along with standards (DNA samples of 
20 known concentration) to get an approximate concentration. View the samples using 
UV light and estimate concentration compared to the standards. Fraction 1-4 = >1.0 
lig/\xL Fraction 5-7 = 500 ng/|al. 
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Prepare the following ligation reactions (5 |il reactions) and incubate 4°C 5 overnight: 



Sample 


H 2 0 


10X Ligase 
ouiier 


lOmM _ 

-ATP 


Lambda 
arms 
(ZAP) 


Insert 
UJNA 


T4 DNA 
Ligase (4 
Wu/(1) 


Fraction 1-4 


0.5 ^1 


0.5 nl 


0.5 


1.0 |xl 


2.0 [il 


0.5 fxl 


Fraction 5-7 


0.5 |^1 


0.5 \il 


0.5 ^ 


1.0 m-1 


2.0 |xl 


0.5 |il 



Test Package and Plate. Package the ligation reactions following 
5 manufacturer's protocol. Stop packaging reactions with 500 jlxI SM buffer and pool 
packaging that came from the same ligation. Titer 1 .0 pi of each pooled reaction on 
appropriate host (OD 60 o = 1 .0) [XLI-Blue MRF]. Add 200 pi host (in mM MgS0 4 ) to 
Falcon 2059 tubes, inoculate with 1 pi packaged phage and incubate at 37°C for 15 
minutes. Add about 3 ml 48°C top agar [50ml stock containing 150 pi IPTG (0.5M) 
10 and 300 pi X-GAL (350 mg/ml)] and plate on 100 mm plates. Incubate the plates at 
37°C, overnight. 



Amplification of Libraries (5.0 x 10 s recombinants from each library). 

Add 3.0 ml host cells (ODeoo^l O) to two 50 ml conical tube and inoculate with 2.5 X 
15 1 0 5 pfu of phage per conical tube. Incubate at 37°C for 20 minutes. Add top agar to 
each tube to a final volume of 45 ml. Plate each tube across five 150 mm plates. 
Incubate the plates at 37°C for 6-8 hours or until plaques are about pin-head in size. 
Overlay the plates with 8-10 ml SM Buffer and place at 4°C overnight (with gentle 
rocking if possible). 

20 

Harvest Phage. Recover phage suspension by pouring the SM buffer off each 
plate into a 50-ml conical tube. Add 3 ml of chloroform, shake vigorously and 
incubate at room temperature for 15 minutes. Centrifuge the tubes at 2K rpm for 10 
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minutes to remove cell debris. Pour supernatant into a sterile flask, add 500 jil 
chloroform and store at 4°C. 

Titer Amplified Library. Make serial dilutions of the harvested phage (for 
5 example, 10' 5 = 1 \xl amplified phage in 1 ml SM Buffer; 10" 6 = 1 (al of the 10" 3 dilution 
in 1 ml SM Buffer). Add 200 |il host (in 10 mM MgS0 4 ) to two tubes. Inoculate one 
tube with 10 jal 10" 6 dilution (10 -5 ). Inoculate the other tube with 1 jul 10" 6 dilution 
(10~ 6 ). Incubate at 37°C for 15 minutes. Add about 3 ml 48°C top agar [50ml stock 
containing 150 |il IPTG (0.5M) and 375 jxl X-GAL (350 mg/ml)] to each tube and 
10 plate on 100 mm plates. Incubate the plates at 37°C, overnight. Excise the ZAP II 
library to create the pBLUESCRIPT library according to manufacturers protocols 
(Stratagene). 

Example 2 

15 Normalization 

Prior to library generation, purified DNA can be normalized. DNA is first 
fractionated according to the following protocol. A sample composed of genomic 
DNA is purified on a cesium-chloride gradient. The cesium chloride (Rf = 1.3980) 
solution is filtered through a 0.2 jxm filter and 15 ml is loaded into a 35 ml OptiSeal 

20 tube (Beckman). The DNA is added and thoroughly mixed. Ten micrograms of 

bis-benzimide (Sigma; Hoechst 33258) is added and mixed thoroughly. The tube is 
then filled with the filtered cesium chloride solution and spun in a VTi50 rotor in a 
Beckman L8-70 Ultracentrifuge at 33,000 rpm for 72 hours. Following 
centrifugation, a syringe pump and fractionator (Brandel Model 1 86) are used to drive 

25 the gradient through an ISCO UA-5 UV absorbance detector set to 280 nm. Peaks 
representing the DNA from the organisms present in an environmental sample are 
obtained. Eubacterial sequences can be detected by PCR amplification of DNA 
encoding rRNA from a 10-fold dilution of the E. coli peak using the following 
primers to amplify: 

30 

Forward primer: 5 ' - AG AGTTTG ATC CTGGCTC AG-3 ' (SEQ ID NO:2) 
Reverse primer: 5'-GGTTACCTTGTTACGACTT-3' (SEQ ID NO:3) 
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Recovered DNA is sheared or enzymatically digested to 3-6 kb fragments. 
Lone-linker primers are ligated and the DNA is sized selected. Size-selected DNA is 
amplified by PCR, if necessary. 



Normalization is then accomplished as follows by resuspending 
double-stranded DNA sample in hybridization buffer (0.12 M NaH 2 P0 4 > pH 6.8/0.82 
M NaCl/1 mM EDTA/0.1% SDS). The sample is overlaid with mineral oil and 
denatured by boiling for 10 minutes. Sample is incubated at 68°C for 12-36 hours. 
10 Double-stranded DNA is separated from single-stranded DNA according to standard 
protocols (Sambrook, 1989) on hydroxyapatite at 60°C. The single-stranded DNA 
fraction is desalted and amplified by PCR. The process is repeated for several more 
rounds (up to 5 or more). 



15 Example 3 

FACS/Biopanning 

Infection of library lysates into Exp503 E.coli strain. 25 ml LB + Tet 

culture of Exp503 were cultured overnight at 37 C. The next day the culture was 
centrifuged at 4000 rpm for 10 minutes and the supernatant decanted. 20ml lOmM 
20 MgS0 4 was added and the OD 6 oo checked. Dilute to OD 1.0. 

In order to obtain a good representation of the library, at least 2-fold (and 
preferably 5-fold) of the library lysate titer was used. For example: Titer of library 
lysate is 2x1 0 6 cfu/ml. Need to plate at least 4x1 0 6 cfu. Can plate approx. 500,000 
25 microcolonies/ 150mm LB-Kan plate. Need 8 plates. Can plate 1 ml of reaction/plate- 
need 8 mis of cells + lysate. 

2-fold (ex. 2 ml) of library lysate was mixed with appropriate amount ( e.g., 6 
ml) of OD 1.0 Exp503. The sample was incubated at 37°C for at least 1 hour. Plated 
30 1 ml reaction on 1 50mm LB-Kan plate x 8 plates and incubated overnight at 30°C. 
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Harvesting, induction, and fixing of library in Exp503 cells. Scrape all 
cells from plates into 20 ml LB using a rubber policeman. Dilute cells approx. 1 :100 
(200 |al cells/ 20 ml LB) and incubate at 37°C until culture is OD0.3. Add 1 :50 - 
dilution of 20% sterile Glucose and incubate at 37°C until culture is OD 1.0. Add 
5 1 : 1 00 dilution of 1M MgSCV Transfer 5 ml of culture to a fresh tube and the 

remaining culture can be used as an uninduced control if desired or discarded. Add 
MOI 5 of CE6 bacteriophage to the remaining 5 ml of culture. (CE6 codes for T7 
RNA Polymerase) (e.g., OD 1 = 8xl0 8 cells/ml x 5 ml = 4xl0 9 cells x MOI 5 = 
2x1 0 10 bacteriophage needed). Incubate culture + CE6 for 2 hr at 37°C. Cool on ice 

10 and centrifuge cells at 4000 rpm for 10 min. Wash with 10 ml PBS. Fix cells in 600 
j_tl PBS + 1.8 ml fresh, filtered 4% paraformaldehyde. Incubate on ice for 2 hrs. (4% 
Paraformaldehyde: Heat 8.25 ml PBS in flask at 65°C. Add 100 \xl 1M NaOH and 
0.5 g paraformaldehyde (stored at 4°C.) Mix until dissolved. Add 4. 15 ml PBS. Cool 
to 0°C. Adjust pH to 7.2 with 0.5 M NaH 2 P0 4 . Cool to 0°C. Syringe filter. Use within 

15 24 hrs). After fixing, centrifuge at 4000 rpm for 10 min. Resuspend in 1 .8 ml PBS 
and 200 jal 0.1% NP40. Store at 4°C overnight. 

Hybridization of fixed cells. Centrifuge fixed cells at 4000 rpm for 10 min. 
Resuspend in 1 ml 40 mM Tris pH7.6/ 0.2% NP40. Transfer 100 \il fixed cells to an 

20 eppendorf tube. Centrifuge for 1 min and remove supernatant. Resuspend each 

reaction in 50 jlxI Hybridization buffer (0.9 M NaCl; 20 mM Tris pH7.4; 0.01% SDS; 
25% formamide- can be made in advance and stored at -20°C). Add 0.5 nmol 
fluorescein-labeled primer to the appropriate reactions. Incubate with rocking at 46°C 
for 2 hr. (Hybridization temperature may depend on sequence of primer and 

25 template.) Add 1 ml wash buffer to each reaction, rinse briefly and centrifuge for 1 
min. Discard supernatant. (Wash buffer: 0.9 M NaCl; 20 mM Tris pH 7.4; 0.01% 
SDS). Add another 1 ml of wash buffer to each reaction, and incubate at 48°C with 
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rocking for 30 min. Centrifuge and remove supernatant. Visualize cells under 
microscope using WIB filter. 

FACS sorting. Dilute cells in 1 ml PBS. If cells are clumping, sonicate for 20 
5 seconds at 1 .5 power. FAC sort the most highly fluorescent single-cells and collect in 
0.5 ml PCR strip tubes (approximately one 96-well plate/ library). PCR single-cells 
with vector specific primers to amplify the insert in each cell. Electrophores all 
samples on an agarose gel and select samples with single inserts. These can be re- 
amplified with Biotin-labeled primers, hybridized to insert-specific primers, and 
10 examined in an ELISA assay. Positive clones can then be sequenced. Alternatively, 
the selected samples can be re-amplified with various combinations of insert-specific 
primers, or sequenced directly. 

15 Example 4 

Cell Staining Prior to FACS Screening 

Gene libraries, including those generated as described in Example 1 and 3, can 
be further screened for bioactivities of interest on a FACS machine as indicated 
herein. A screening process begins with staining of the cells with a desirable 
20 substrate according to the following example. 

A gene library is made from the hyperthermophilic archaeon Sulfulobus 
solfataricus in the A,-ZAPII vector according to the manufacturers instructions 
(Stratagene Cloning Systems, Inc., La Jolla, CA), and excised into the 
25 pBLUESCRIPT plasmid according to the manufacturers instructions (Stratagene). 
DNA was isolated using the IsoQuick DNA isolation kit according to the 
manufacturers instructions (Orca, Inc., Bothell, WA). 

To screen for p-galactosidase activity, cells are stained as follows. Cells are 
30 cultivated overnight at 37°C in an orbital shaker at 250rpm. Cells are centrifuged to 
collect about 2xl0 7 cells (0.1ml of the culture), resuspended in 1ml of deionized 
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water, and stained with Cn-Fluoroscein-Di- (-D-galactopyranoside (FDG). Briefly, 
0.5ml of cells are mixed with 50 jal C12-FDG staining solution (lmg C12-FDG in 1ml 
of a mixture of 98% H 2 0, 1% DMSO, 1% EtOH) and 50 j^l Propidium iodide (PI) 
staining solution (50 (ig/ml of distilled water). The sample is incubated in the dark at 
5 37°C with shaking at 150rpm for 30 minutes. Cells are then heated to 70°C for 30 
minutes (this step can be avoided if sample is not derived from a hyperthermophilic 
organism). 



The excised X-ZAP II library is incubated for 2 hours and induced with IPTG. 
Cells are centrifuged, washed and stained with the desired enzyme substrate, for 
example Ci2-Fluoroscein-Di-(-D-galactopyranoside (FDG) as in Example 3. Clones 

1 5 are sorted on a commercially available FACS machine, and positives are collected. 
Cells are lysed according to standard techniques (Current Protocols in Molecular 
Biology, 1987) and plasmids are transformed into new host by electroporation using 
standard techniques. Transformed cells are plated for secondary screening. The 
procedure is illustrated in Figure 1 . Sorted organisms can be grown and plated for 

20 secondary screening. 

Example 6 
Sorting Directly on Microtiter Plates 

Cells can be sorted in a FACS instrument directly on microtiter plates in 
25 accordance with the present invention. Sorting in this fashion facilitates downstream 
processing of positive clones. 

E.coli cells containing P-galactosidase genes are exposed to a staining 
solution. These cells are then left to sit on ice for three minutes. For the cell sorting 
30 procedure they are diluted 1 : 100 in deionized water or in Phosphate Buffered Saline 
solution according to the manufacturers protocols for cell sorting. The cells are then 
sorted by the FACS instrument into microtiter plates, one cell per well. The sorting 



10 



Example 5 

Screening of Expression Libraries by FACS and Recovery of Genetic 
Information of Sorted Organisms 
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criteria is fluorescein fluorescence indicating (3-galactosidase activity or PI for 
indicating the staining of dead cells (unlike viable cells, dead cells have no membrane 
potential; hence PI remains in the cell with dead cells and is pumped out with live 
cells). 

5 

Table 1 



Habitat 


Cultured (%} 


Seawater 


0.001-0.1 


Freshwater 


0.25 


Mesotrophic lake 


0.01-1.0 


Unpolluted esturine waters 


0.1-3.0 


Activated sludge 


1.0-15.0 


Sediments 


0.25 


Soil 


0.3 


Example 6 

Production of single cells or fragmented mvcelia 



10 Inoculate 25ml MYME media (see recipe below) in 250 ml baffled flask with 

100 jal of Streptomyces 10712 spore suspension and incubated overnight @ 30 °C 
250rpm. After 24 hour incubation, transfer 10ml to 50ml conical polypropylene 
centrifuge tube and centrifuge @ 4,000rpm for 10 minutes @ 25 °C. Decant 
supernatant and resuspend pellet in 10ml 0.05M TES buffer. Sort cells into MYM 

15 agar plates (sort 1 cell per drop, 5 cells per drop, 10 cells per drop) and incubate plates 
at30°C. 

MYME media (Yang, et.al., 1995 J. Bacteriol. 177(21): 6111-6117) contains: 
10.3% sucrose, 1% maltose, 0.5% peptone, 0.3% yeast extract, 0.3% maltose extract, 
20 5mM MgCl 2 and 1% glycine. 
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While the invention has been described in detail with reference to certain 
preferred embodiments thereof, it will be understood that modifications and variations 
are within the spirit and scope of that which is described and claimed. 
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